diff options
author | cinap_lenrek <cinap_lenrek@localhost> | 2011-05-03 11:25:13 +0000 |
---|---|---|
committer | cinap_lenrek <cinap_lenrek@localhost> | 2011-05-03 11:25:13 +0000 |
commit | 458120dd40db6b4df55a4e96b650e16798ef06a0 (patch) | |
tree | 8f82685be24fef97e715c6f5ca4c68d34d5074ee /sys/src/cmd/python/Doc/lib/libre.tex | |
parent | 3a742c699f6806c1145aea5149bf15de15a0afd7 (diff) |
add hg and python
Diffstat (limited to 'sys/src/cmd/python/Doc/lib/libre.tex')
-rw-r--r-- | sys/src/cmd/python/Doc/lib/libre.tex | 954 |
1 files changed, 954 insertions, 0 deletions
diff --git a/sys/src/cmd/python/Doc/lib/libre.tex b/sys/src/cmd/python/Doc/lib/libre.tex new file mode 100644 index 000000000..84e382d0f --- /dev/null +++ b/sys/src/cmd/python/Doc/lib/libre.tex @@ -0,0 +1,954 @@ +\section{\module{re} --- + Regular expression operations} +\declaremodule{standard}{re} +\moduleauthor{Fredrik Lundh}{fredrik@pythonware.com} +\sectionauthor{Andrew M. Kuchling}{amk@amk.ca} + + +\modulesynopsis{Regular expression search and match operations with a + Perl-style expression syntax.} + + +This module provides regular expression matching operations similar to +those found in Perl. Regular expression pattern strings may not +contain null bytes, but can specify the null byte using the +\code{\e\var{number}} notation. Both patterns and strings to be +searched can be Unicode strings as well as 8-bit strings. The +\module{re} module is always available. + +Regular expressions use the backslash character (\character{\e}) to +indicate special forms or to allow special characters to be used +without invoking their special meaning. This collides with Python's +usage of the same character for the same purpose in string literals; +for example, to match a literal backslash, one might have to write +\code{'\e\e\e\e'} as the pattern string, because the regular expression +must be \samp{\e\e}, and each backslash must be expressed as +\samp{\e\e} inside a regular Python string literal. + +The solution is to use Python's raw string notation for regular +expression patterns; backslashes are not handled in any special way in +a string literal prefixed with \character{r}. So \code{r"\e n"} is a +two-character string containing \character{\e} and \character{n}, +while \code{"\e n"} is a one-character string containing a newline. +Usually patterns will be expressed in Python code using this raw +string notation. + +\begin{seealso} + \seetitle{Mastering Regular Expressions}{Book on regular expressions + by Jeffrey Friedl, published by O'Reilly. The second + edition of the book no longer covers Python at all, + but the first edition covered writing good regular expression + patterns in great detail.} +\end{seealso} + + +\subsection{Regular Expression Syntax \label{re-syntax}} + +A regular expression (or RE) specifies a set of strings that matches +it; the functions in this module let you check if a particular string +matches a given regular expression (or if a given regular expression +matches a particular string, which comes down to the same thing). + +Regular expressions can be concatenated to form new regular +expressions; if \emph{A} and \emph{B} are both regular expressions, +then \emph{AB} is also a regular expression. In general, if a string +\emph{p} matches \emph{A} and another string \emph{q} matches \emph{B}, +the string \emph{pq} will match AB. This holds unless \emph{A} or +\emph{B} contain low precedence operations; boundary conditions between +\emph{A} and \emph{B}; or have numbered group references. Thus, complex +expressions can easily be constructed from simpler primitive +expressions like the ones described here. For details of the theory +and implementation of regular expressions, consult the Friedl book +referenced above, or almost any textbook about compiler construction. + +A brief explanation of the format of regular expressions follows. For +further information and a gentler presentation, consult the Regular +Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}. + +Regular expressions can contain both special and ordinary characters. +Most ordinary characters, like \character{A}, \character{a}, or +\character{0}, are the simplest regular expressions; they simply match +themselves. You can concatenate ordinary characters, so \regexp{last} +matches the string \code{'last'}. (In the rest of this section, we'll +write RE's in \regexp{this special style}, usually without quotes, and +strings to be matched \code{'in single quotes'}.) + +Some characters, like \character{|} or \character{(}, are special. +Special characters either stand for classes of ordinary characters, or +affect how the regular expressions around them are interpreted. + +The special characters are: +% +\begin{description} + +\item[\character{.}] (Dot.) In the default mode, this matches any +character except a newline. If the \constant{DOTALL} flag has been +specified, this matches any character including a newline. + +\item[\character{\textasciicircum}] (Caret.) Matches the start of the +string, and in \constant{MULTILINE} mode also matches immediately +after each newline. + +\item[\character{\$}] Matches the end of the string or just before the +newline at the end of the string, and in \constant{MULTILINE} mode +also matches before a newline. \regexp{foo} matches both 'foo' and +'foobar', while the regular expression \regexp{foo\$} matches only +'foo'. More interestingly, searching for \regexp{foo.\$} in +'foo1\textbackslash nfoo2\textbackslash n' matches 'foo2' normally, +but 'foo1' in \constant{MULTILINE} mode. + +\item[\character{*}] Causes the resulting RE to +match 0 or more repetitions of the preceding RE, as many repetitions +as are possible. \regexp{ab*} will +match 'a', 'ab', or 'a' followed by any number of 'b's. + +\item[\character{+}] Causes the +resulting RE to match 1 or more repetitions of the preceding RE. +\regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it +will not match just 'a'. + +\item[\character{?}] Causes the resulting RE to +match 0 or 1 repetitions of the preceding RE. \regexp{ab?} will +match either 'a' or 'ab'. + +\item[\code{*?}, \code{+?}, \code{??}] The \character{*}, +\character{+}, and \character{?} qualifiers are all \dfn{greedy}; they +match as much text as possible. Sometimes this behaviour isn't +desired; if the RE \regexp{<.*>} is matched against +\code{'<H1>title</H1>'}, it will match the entire string, and not just +\code{'<H1>'}. Adding \character{?} after the qualifier makes it +perform the match in \dfn{non-greedy} or \dfn{minimal} fashion; as +\emph{few} characters as possible will be matched. Using \regexp{.*?} +in the previous expression will match only \code{'<H1>'}. + +\item[\code{\{\var{m}\}}] +Specifies that exactly \var{m} copies of the previous RE should be +matched; fewer matches cause the entire RE not to match. For example, +\regexp{a\{6\}} will match exactly six \character{a} characters, but +not five. + +\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from +\var{m} to \var{n} repetitions of the preceding RE, attempting to +match as many repetitions as possible. For example, \regexp{a\{3,5\}} +will match from 3 to 5 \character{a} characters. Omitting \var{m} +specifies a lower bound of zero, +and omitting \var{n} specifies an infinite upper bound. As an +example, \regexp{a\{4,\}b} will match \code{aaaab} or a thousand +\character{a} characters followed by a \code{b}, but not \code{aaab}. +The comma may not be omitted or the modifier would be confused with +the previously described form. + +\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to +match from \var{m} to \var{n} repetitions of the preceding RE, +attempting to match as \emph{few} repetitions as possible. This is +the non-greedy version of the previous qualifier. For example, on the +6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5 +\character{a} characters, while \regexp{a\{3,5\}?} will only match 3 +characters. + +\item[\character{\e}] Either escapes special characters (permitting +you to match characters like \character{*}, \character{?}, and so +forth), or signals a special sequence; special sequences are discussed +below. + +If you're not using a raw string to +express the pattern, remember that Python also uses the +backslash as an escape sequence in string literals; if the escape +sequence isn't recognized by Python's parser, the backslash and +subsequent character are included in the resulting string. However, +if Python would recognize the resulting sequence, the backslash should +be repeated twice. This is complicated and hard to understand, so +it's highly recommended that you use raw strings for all but the +simplest expressions. + +\item[\code{[]}] Used to indicate a set of characters. Characters can +be listed individually, or a range of characters can be indicated by +giving two characters and separating them by a \character{-}. Special +characters are not active inside sets. For example, \regexp{[akm\$]} +will match any of the characters \character{a}, \character{k}, +\character{m}, or \character{\$}; \regexp{[a-z]} +will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any +letter or digit. Character classes such as \code{\e w} or \code{\e S} +(defined below) are also acceptable inside a range. If you want to +include a \character{]} or a \character{-} inside a set, precede it with a +backslash, or place it as the first character. The +pattern \regexp{[]]} will match \code{']'}, for example. + +You can match the characters not within a range by \dfn{complementing} +the set. This is indicated by including a +\character{\textasciicircum} as the first character of the set; +\character{\textasciicircum} elsewhere will simply match the +\character{\textasciicircum} character. For example, +\regexp{[{\textasciicircum}5]} will match +any character except \character{5}, and +\regexp{[\textasciicircum\code{\textasciicircum}]} will match any character +except \character{\textasciicircum}. + +\item[\character{|}]\code{A|B}, where A and B can be arbitrary REs, +creates a regular expression that will match either A or B. An +arbitrary number of REs can be separated by the \character{|} in this +way. This can be used inside groups (see below) as well. As the target +string is scanned, REs separated by \character{|} are tried from left to +right. When one pattern completely matches, that branch is accepted. +This means that once \code{A} matches, \code{B} will not be tested further, +even if it would produce a longer overall match. In other words, the +\character{|} operator is never greedy. To match a literal \character{|}, +use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. + +\item[\code{(...)}] Matches whatever regular expression is inside the +parentheses, and indicates the start and end of a group; the contents +of a group can be retrieved after a match has been performed, and can +be matched later in the string with the \regexp{\e \var{number}} special +sequence, described below. To match the literals \character{(} or +\character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them +inside a character class: \regexp{[(] [)]}. + +\item[\code{(?...)}] This is an extension notation (a \character{?} +following a \character{(} is not meaningful otherwise). The first +character after the \character{?} +determines what the meaning and further syntax of the construct is. +Extensions usually do not create a new group; +\regexp{(?P<\var{name}>...)} is the only exception to this rule. +Following are the currently supported extensions. + +\item[\code{(?iLmsux)}] (One or more letters from the set \character{i}, +\character{L}, \character{m}, \character{s}, \character{u}, +\character{x}.) The group matches the empty string; the letters set +the corresponding flags (\constant{re.I}, \constant{re.L}, +\constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X}) +for the entire regular expression. This is useful if you wish to +include the flags as part of the regular expression, instead of +passing a \var{flag} argument to the \function{compile()} function. + +Note that the \regexp{(?x)} flag changes how the expression is parsed. +It should be used first in the expression string, or after one or more +whitespace characters. If there are non-whitespace characters before +the flag, the results are undefined. + +\item[\code{(?:...)}] A non-grouping version of regular parentheses. +Matches whatever regular expression is inside the parentheses, but the +substring matched by the +group \emph{cannot} be retrieved after performing a match or +referenced later in the pattern. + +\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but +the substring matched by the group is accessible via the symbolic group +name \var{name}. Group names must be valid Python identifiers, and +each group name must be defined only once within a regular expression. A +symbolic group is also a numbered group, just as if the group were not +named. So the group named 'id' in the example above can also be +referenced as the numbered group 1. + +For example, if the pattern is +\regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its +name in arguments to methods of match objects, such as +\code{m.group('id')} or \code{m.end('id')}, and also by name in +pattern text (for example, \regexp{(?P=id)}) and replacement text +(such as \code{\e g<id>}). + +\item[\code{(?P=\var{name})}] Matches whatever text was matched by the +earlier group named \var{name}. + +\item[\code{(?\#...)}] A comment; the contents of the parentheses are +simply ignored. + +\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't +consume any of the string. This is called a lookahead assertion. For +example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's +followed by \code{'Asimov'}. + +\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This +is a negative lookahead assertion. For example, +\regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not} +followed by \code{'Asimov'}. + +\item[\code{(?<=...)}] Matches if the current position in the string +is preceded by a match for \regexp{...} that ends at the current +position. This is called a \dfn{positive lookbehind assertion}. +\regexp{(?<=abc)def} will find a match in \samp{abcdef}, since the +lookbehind will back up 3 characters and check if the contained +pattern matches. The contained pattern must only match strings of +some fixed length, meaning that \regexp{abc} or \regexp{a|b} are +allowed, but \regexp{a*} and \regexp{a\{3,4\}} are not. Note that +patterns which start with positive lookbehind assertions will never +match at the beginning of the string being searched; you will most +likely want to use the \function{search()} function rather than the +\function{match()} function: + +\begin{verbatim} +>>> import re +>>> m = re.search('(?<=abc)def', 'abcdef') +>>> m.group(0) +'def' +\end{verbatim} + +This example looks for a word following a hyphen: + +\begin{verbatim} +>>> m = re.search('(?<=-)\w+', 'spam-egg') +>>> m.group(0) +'egg' +\end{verbatim} + +\item[\code{(?<!...)}] Matches if the current position in the string +is not preceded by a match for \regexp{...}. This is called a +\dfn{negative lookbehind assertion}. Similar to positive lookbehind +assertions, the contained pattern must only match strings of some +fixed length. Patterns which start with negative lookbehind +assertions may match at the beginning of the string being searched. + +\item[\code{(?(\var{id/name})yes-pattern|no-pattern)}] Will try to match +with \regexp{yes-pattern} if the group with given \var{id} or \var{name} +exists, and with \regexp{no-pattern} if it doesn't. \regexp{|no-pattern} +is optional and can be omitted. For example, +\regexp{(<)?(\e w+@\e w+(?:\e .\e w+)+)(?(1)>)} is a poor email matching +pattern, which will match with \code{'<user@host.com>'} as well as +\code{'user@host.com'}, but not with \code{'<user@host.com'}. +\versionadded{2.4} + +\end{description} + +The special sequences consist of \character{\e} and a character from the +list below. If the ordinary character is not on the list, then the +resulting RE will match the second character. For example, +\regexp{\e\$} matches the character \character{\$}. +% +\begin{description} + +\item[\code{\e \var{number}}] Matches the contents of the group of the +same number. Groups are numbered starting from 1. For example, +\regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not +\code{'the end'} (note +the space after the group). This special sequence can only be used to +match one of the first 99 groups. If the first digit of \var{number} +is 0, or \var{number} is 3 octal digits long, it will not be interpreted +as a group match, but as the character with octal value \var{number}. +Inside the \character{[} and \character{]} of a character class, all numeric +escapes are treated as characters. + +\item[\code{\e A}] Matches only at the start of the string. + +\item[\code{\e b}] Matches the empty string, but only at the +beginning or end of a word. A word is defined as a sequence of +alphanumeric or underscore characters, so the end of a word is indicated by +whitespace or a non-alphanumeric, non-underscore character. Note that +{}\code{\e b} is defined as the boundary between \code{\e w} and \code{\e +W}, so the precise set of characters deemed to be alphanumeric depends on the +values of the \code{UNICODE} and \code{LOCALE} flags. Inside a character +range, \regexp{\e b} represents the backspace character, for compatibility +with Python's string literals. + +\item[\code{\e B}] Matches the empty string, but only when it is \emph{not} +at the beginning or end of a word. This is just the opposite of {}\code{\e +b}, so is also subject to the settings of \code{LOCALE} and \code{UNICODE}. + +\item[\code{\e d}]When the \constant{UNICODE} flag is not specified, matches +any decimal digit; this is equivalent to the set \regexp{[0-9]}. +With \constant{UNICODE}, it will match whatever is classified as a digit +in the Unicode character properties database. + +\item[\code{\e D}]When the \constant{UNICODE} flag is not specified, matches +any non-digit character; this is equivalent to the set +\regexp{[{\textasciicircum}0-9]}. With \constant{UNICODE}, it will match +anything other than character marked as digits in the Unicode character +properties database. + +\item[\code{\e s}]When the \constant{LOCALE} and \constant{UNICODE} +flags are not specified, matches any whitespace character; this is +equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}. +With \constant{LOCALE}, it will match this set plus whatever characters +are defined as space for the current locale. If \constant{UNICODE} is set, +this will match the characters \regexp{[ \e t\e n\e r\e f\e v]} plus +whatever is classified as space in the Unicode character properties +database. + +\item[\code{\e S}]When the \constant{LOCALE} and \constant{UNICODE} +flags are not specified, matches any non-whitespace character; this is +equivalent to the set \regexp{[\textasciicircum\ \e t\e n\e r\e f\e v]} +With \constant{LOCALE}, it will match any character not in this set, +and not defined as space in the current locale. If \constant{UNICODE} +is set, this will match anything other than \regexp{[ \e t\e n\e r\e f\e v]} +and characters marked as space in the Unicode character properties database. + +\item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE} +flags are not specified, matches any alphanumeric character and the +underscore; this is equivalent to the set +\regexp{[a-zA-Z0-9_]}. With \constant{LOCALE}, it will match the set +\regexp{[0-9_]} plus whatever characters are defined as alphanumeric for +the current locale. If \constant{UNICODE} is set, this will match the +characters \regexp{[0-9_]} plus whatever is classified as alphanumeric +in the Unicode character properties database. + +\item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE} +flags are not specified, matches any non-alphanumeric character; this +is equivalent to the set \regexp{[{\textasciicircum}a-zA-Z0-9_]}. With +\constant{LOCALE}, it will match any character not in the set +\regexp{[0-9_]}, and not defined as alphanumeric for the current locale. +If \constant{UNICODE} is set, this will match anything other than +\regexp{[0-9_]} and characters marked as alphanumeric in the Unicode +character properties database. + +\item[\code{\e Z}]Matches only at the end of the string. + +\end{description} + +Most of the standard escapes supported by Python string literals are +also accepted by the regular expression parser: + +\begin{verbatim} +\a \b \f \n +\r \t \v \x +\\ +\end{verbatim} + +Octal escapes are included in a limited form: If the first digit is a +0, or if there are three octal digits, it is considered an octal +escape. Otherwise, it is a group reference. As for string literals, +octal escapes are always at most three digits in length. + + +% Note the lack of a period in the section title; it causes problems +% with readers of the GNU info version. See http://www.python.org/sf/581414. +\subsection{Matching vs Searching \label{matching-searching}} +\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org} + +Python offers two different primitive operations based on regular +expressions: match and search. If you are accustomed to Perl's +semantics, the search operation is what you're looking for. See the +\function{search()} function and corresponding method of compiled +regular expression objects. + +Note that match may differ from search using a regular expression +beginning with \character{\textasciicircum}: +\character{\textasciicircum} matches only at the +start of the string, or in \constant{MULTILINE} mode also immediately +following a newline. The ``match'' operation succeeds only if the +pattern matches at the start of the string regardless of mode, or at +the starting position given by the optional \var{pos} argument +regardless of whether a newline precedes it. + +% Examples from Tim Peters: +\begin{verbatim} +re.compile("a").match("ba", 1) # succeeds +re.compile("^a").search("ba", 1) # fails; 'a' not at start +re.compile("^a").search("\na", 1) # fails; 'a' not at start +re.compile("^a", re.M).search("\na", 1) # succeeds +re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n +\end{verbatim} + + +\subsection{Module Contents} +\nodename{Contents of Module re} + +The module defines several functions, constants, and an exception. Some of the +functions are simplified versions of the full featured methods for compiled +regular expressions. Most non-trivial applications always use the compiled +form. + +\begin{funcdesc}{compile}{pattern\optional{, flags}} + Compile a regular expression pattern into a regular expression + object, which can be used for matching using its \function{match()} and + \function{search()} methods, described below. + + The expression's behaviour can be modified by specifying a + \var{flags} value. Values can be any of the following variables, + combined using bitwise OR (the \code{|} operator). + +The sequence + +\begin{verbatim} +prog = re.compile(pat) +result = prog.match(str) +\end{verbatim} + +is equivalent to + +\begin{verbatim} +result = re.match(pat, str) +\end{verbatim} + +but the version using \function{compile()} is more efficient when the +expression will be used several times in a single program. +%(The compiled version of the last pattern passed to +%\function{re.match()} or \function{re.search()} is cached, so +%programs that use only a single regular expression at a time needn't +%worry about compiling regular expressions.) +\end{funcdesc} + +\begin{datadesc}{I} +\dataline{IGNORECASE} +Perform case-insensitive matching; expressions like \regexp{[A-Z]} +will match lowercase letters, too. This is not affected by the +current locale. +\end{datadesc} + +\begin{datadesc}{L} +\dataline{LOCALE} +Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, \regexp{\e B}, +\regexp{\e s} and \regexp{\e S} dependent on the current locale. +\end{datadesc} + +\begin{datadesc}{M} +\dataline{MULTILINE} +When specified, the pattern character \character{\textasciicircum} +matches at the beginning of the string and at the beginning of each +line (immediately following each newline); and the pattern character +\character{\$} matches at the end of the string and at the end of each +line (immediately preceding each newline). By default, +\character{\textasciicircum} matches only at the beginning of the +string, and \character{\$} only at the end of the string and +immediately before the newline (if any) at the end of the string. +\end{datadesc} + +\begin{datadesc}{S} +\dataline{DOTALL} +Make the \character{.} special character match any character at all, +including a newline; without this flag, \character{.} will match +anything \emph{except} a newline. +\end{datadesc} + +\begin{datadesc}{U} +\dataline{UNICODE} +Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, \regexp{\e B}, +\regexp{\e d}, \regexp{\e D}, \regexp{\e s} and \regexp{\e S} +dependent on the Unicode character properties database. +\versionadded{2.0} +\end{datadesc} + +\begin{datadesc}{X} +\dataline{VERBOSE} +This flag allows you to write regular expressions that look nicer. +Whitespace within the pattern is ignored, +except when in a character class or preceded by an unescaped +backslash, and, when a line contains a \character{\#} neither in a +character class or preceded by an unescaped backslash, all characters +from the leftmost such \character{\#} through the end of the line are +ignored. +% XXX should add an example here +\end{datadesc} + + +\begin{funcdesc}{search}{pattern, string\optional{, flags}} + Scan through \var{string} looking for a location where the regular + expression \var{pattern} produces a match, and return a + corresponding \class{MatchObject} instance. + Return \code{None} if no + position in the string matches the pattern; note that this is + different from finding a zero-length match at some point in the string. +\end{funcdesc} + +\begin{funcdesc}{match}{pattern, string\optional{, flags}} + If zero or more characters at the beginning of \var{string} match + the regular expression \var{pattern}, return a corresponding + \class{MatchObject} instance. Return \code{None} if the string does not + match the pattern; note that this is different from a zero-length + match. + + \note{If you want to locate a match anywhere in + \var{string}, use \method{search()} instead.} +\end{funcdesc} + +\begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}} + Split \var{string} by the occurrences of \var{pattern}. If + capturing parentheses are used in \var{pattern}, then the text of all + groups in the pattern are also returned as part of the resulting list. + If \var{maxsplit} is nonzero, at most \var{maxsplit} splits + occur, and the remainder of the string is returned as the final + element of the list. (Incompatibility note: in the original Python + 1.5 release, \var{maxsplit} was ignored. This has been fixed in + later releases.) + +\begin{verbatim} +>>> re.split('\W+', 'Words, words, words.') +['Words', 'words', 'words', ''] +>>> re.split('(\W+)', 'Words, words, words.') +['Words', ', ', 'words', ', ', 'words', '.', ''] +>>> re.split('\W+', 'Words, words, words.', 1) +['Words', 'words, words.'] +\end{verbatim} +\end{funcdesc} + +\begin{funcdesc}{findall}{pattern, string\optional{, flags}} + Return a list of all non-overlapping matches of \var{pattern} in + \var{string}. If one or more groups are present in the pattern, + return a list of groups; this will be a list of tuples if the + pattern has more than one group. Empty matches are included in the + result unless they touch the beginning of another match. + \versionadded{1.5.2} + \versionchanged[Added the optional flags argument]{2.4} +\end{funcdesc} + +\begin{funcdesc}{finditer}{pattern, string\optional{, flags}} + Return an iterator over all non-overlapping matches for the RE + \var{pattern} in \var{string}. For each match, the iterator returns + a match object. Empty matches are included in the result unless they + touch the beginning of another match. + \versionadded{2.2} + \versionchanged[Added the optional flags argument]{2.4} +\end{funcdesc} + +\begin{funcdesc}{sub}{pattern, repl, string\optional{, count}} + Return the string obtained by replacing the leftmost non-overlapping + occurrences of \var{pattern} in \var{string} by the replacement + \var{repl}. If the pattern isn't found, \var{string} is returned + unchanged. \var{repl} can be a string or a function; if it is a + string, any backslash escapes in it are processed. That is, + \samp{\e n} is converted to a single newline character, \samp{\e r} + is converted to a linefeed, and so forth. Unknown escapes such as + \samp{\e j} are left alone. Backreferences, such as \samp{\e6}, are + replaced with the substring matched by group 6 in the pattern. For + example: + +\begin{verbatim} +>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', +... r'static PyObject*\npy_\1(void)\n{', +... 'def myfunc():') +'static PyObject*\npy_myfunc(void)\n{' +\end{verbatim} + + If \var{repl} is a function, it is called for every non-overlapping + occurrence of \var{pattern}. The function takes a single match + object argument, and returns the replacement string. For example: + +\begin{verbatim} +>>> def dashrepl(matchobj): +... if matchobj.group(0) == '-': return ' ' +... else: return '-' +>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') +'pro--gram files' +\end{verbatim} + + The pattern may be a string or an RE object; if you need to specify + regular expression flags, you must use a RE object, or use embedded + modifiers in a pattern; for example, \samp{sub("(?i)b+", "x", "bbbb + BBBB")} returns \code{'x x'}. + + The optional argument \var{count} is the maximum number of pattern + occurrences to be replaced; \var{count} must be a non-negative + integer. If omitted or zero, all occurrences will be replaced. + Empty matches for the pattern are replaced only when not adjacent to + a previous match, so \samp{sub('x*', '-', 'abc')} returns + \code{'-a-b-c-'}. + + In addition to character escapes and backreferences as described + above, \samp{\e g<name>} will use the substring matched by the group + named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax. + \samp{\e g<number>} uses the corresponding group number; + \samp{\e g<2>} is therefore equivalent to \samp{\e 2}, but isn't + ambiguous in a replacement such as \samp{\e g<2>0}. \samp{\e 20} + would be interpreted as a reference to group 20, not a reference to + group 2 followed by the literal character \character{0}. The + backreference \samp{\e g<0>} substitutes in the entire substring + matched by the RE. +\end{funcdesc} + +\begin{funcdesc}{subn}{pattern, repl, string\optional{, count}} + Perform the same operation as \function{sub()}, but return a tuple + \code{(\var{new_string}, \var{number_of_subs_made})}. +\end{funcdesc} + +\begin{funcdesc}{escape}{string} + Return \var{string} with all non-alphanumerics backslashed; this is + useful if you want to match an arbitrary literal string that may have + regular expression metacharacters in it. +\end{funcdesc} + +\begin{excdesc}{error} + Exception raised when a string passed to one of the functions here + is not a valid regular expression (for example, it might contain + unmatched parentheses) or when some other error occurs during + compilation or matching. It is never an error if a string contains + no match for a pattern. +\end{excdesc} + + +\subsection{Regular Expression Objects \label{re-objects}} + +Compiled regular expression objects support the following methods and +attributes: + +\begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{, + endpos}}} + If zero or more characters at the beginning of \var{string} match + this regular expression, return a corresponding + \class{MatchObject} instance. Return \code{None} if the string does not + match the pattern; note that this is different from a zero-length + match. + + \note{If you want to locate a match anywhere in + \var{string}, use \method{search()} instead.} + + The optional second parameter \var{pos} gives an index in the string + where the search is to start; it defaults to \code{0}. This is not + completely equivalent to slicing the string; the + \code{'\textasciicircum'} pattern + character matches at the real beginning of the string and at positions + just after a newline, but not necessarily at the index where the search + is to start. + + The optional parameter \var{endpos} limits how far the string will + be searched; it will be as if the string is \var{endpos} characters + long, so only the characters from \var{pos} to \code{\var{endpos} - + 1} will be searched for a match. If \var{endpos} is less than + \var{pos}, no match will be found, otherwise, if \var{rx} is a + compiled regular expression object, + \code{\var{rx}.match(\var{string}, 0, 50)} is equivalent to + \code{\var{rx}.match(\var{string}[:50], 0)}. +\end{methoddesc} + +\begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{, + endpos}}} + Scan through \var{string} looking for a location where this regular + expression produces a match, and return a + corresponding \class{MatchObject} instance. Return \code{None} if no + position in the string matches the pattern; note that this is + different from finding a zero-length match at some point in the string. + + The optional \var{pos} and \var{endpos} parameters have the same + meaning as for the \method{match()} method. +\end{methoddesc} + +\begin{methoddesc}[RegexObject]{split}{string\optional{, + maxsplit\code{ = 0}}} +Identical to the \function{split()} function, using the compiled pattern. +\end{methoddesc} + +\begin{methoddesc}[RegexObject]{findall}{string\optional{, pos\optional{, + endpos}}} +Identical to the \function{findall()} function, using the compiled pattern. +\end{methoddesc} + +\begin{methoddesc}[RegexObject]{finditer}{string\optional{, pos\optional{, + endpos}}} +Identical to the \function{finditer()} function, using the compiled pattern. +\end{methoddesc} + +\begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}} +Identical to the \function{sub()} function, using the compiled pattern. +\end{methoddesc} + +\begin{methoddesc}[RegexObject]{subn}{repl, string\optional{, + count\code{ = 0}}} +Identical to the \function{subn()} function, using the compiled pattern. +\end{methoddesc} + + +\begin{memberdesc}[RegexObject]{flags} +The flags argument used when the RE object was compiled, or +\code{0} if no flags were provided. +\end{memberdesc} + +\begin{memberdesc}[RegexObject]{groupindex} +A dictionary mapping any symbolic group names defined by +\regexp{(?P<\var{id}>)} to group numbers. The dictionary is empty if no +symbolic groups were used in the pattern. +\end{memberdesc} + +\begin{memberdesc}[RegexObject]{pattern} +The pattern string from which the RE object was compiled. +\end{memberdesc} + + +\subsection{Match Objects \label{match-objects}} + +\class{MatchObject} instances support the following methods and +attributes: + +\begin{methoddesc}[MatchObject]{expand}{template} + Return the string obtained by doing backslash substitution on the +template string \var{template}, as done by the \method{sub()} method. +Escapes such as \samp{\e n} are converted to the appropriate +characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and +named backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced +by the contents of the corresponding group. +\end{methoddesc} + +\begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}} +Returns one or more subgroups of the match. If there is a single +argument, the result is a single string; if there are +multiple arguments, the result is a tuple with one item per argument. +Without arguments, \var{group1} defaults to zero (the whole match +is returned). +If a \var{groupN} argument is zero, the corresponding return value is the +entire matching string; if it is in the inclusive range [1..99], it is +the string matching the corresponding parenthesized group. If a +group number is negative or larger than the number of groups defined +in the pattern, an \exception{IndexError} exception is raised. +If a group is contained in a part of the pattern that did not match, +the corresponding result is \code{None}. If a group is contained in a +part of the pattern that matched multiple times, the last match is +returned. + +If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax, +the \var{groupN} arguments may also be strings identifying groups by +their group name. If a string argument is not used as a group name in +the pattern, an \exception{IndexError} exception is raised. + +A moderately complicated example: + +\begin{verbatim} +m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14') +\end{verbatim} + +After performing this match, \code{m.group(1)} is \code{'3'}, as is +\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}. +\end{methoddesc} + +\begin{methoddesc}[MatchObject]{groups}{\optional{default}} +Return a tuple containing all the subgroups of the match, from 1 up to +however many groups are in the pattern. The \var{default} argument is +used for groups that did not participate in the match; it defaults to +\code{None}. (Incompatibility note: in the original Python 1.5 +release, if the tuple was one element long, a string would be returned +instead. In later versions (from 1.5.1 on), a singleton tuple is +returned in such cases.) +\end{methoddesc} + +\begin{methoddesc}[MatchObject]{groupdict}{\optional{default}} +Return a dictionary containing all the \emph{named} subgroups of the +match, keyed by the subgroup name. The \var{default} argument is +used for groups that did not participate in the match; it defaults to +\code{None}. +\end{methoddesc} + +\begin{methoddesc}[MatchObject]{start}{\optional{group}} +\methodline{end}{\optional{group}} +Return the indices of the start and end of the substring +matched by \var{group}; \var{group} defaults to zero (meaning the whole +matched substring). +Return \code{-1} if \var{group} exists but +did not contribute to the match. For a match object +\var{m}, and a group \var{g} that did contribute to the match, the +substring matched by group \var{g} (equivalent to +\code{\var{m}.group(\var{g})}) is + +\begin{verbatim} +m.string[m.start(g):m.end(g)] +\end{verbatim} + +Note that +\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if +\var{group} matched a null string. For example, after \code{\var{m} = +re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1, +\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and +\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises +an \exception{IndexError} exception. +\end{methoddesc} + +\begin{methoddesc}[MatchObject]{span}{\optional{group}} +For \class{MatchObject} \var{m}, return the 2-tuple +\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}. +Note that if \var{group} did not contribute to the match, this is +\code{(-1, -1)}. Again, \var{group} defaults to zero. +\end{methoddesc} + +\begin{memberdesc}[MatchObject]{pos} +The value of \var{pos} which was passed to the \function{search()} or +\function{match()} method of the \class{RegexObject}. This is the +index into the string at which the RE engine started looking for a +match. +\end{memberdesc} + +\begin{memberdesc}[MatchObject]{endpos} +The value of \var{endpos} which was passed to the \function{search()} +or \function{match()} method of the \class{RegexObject}. This is the +index into the string beyond which the RE engine will not go. +\end{memberdesc} + +\begin{memberdesc}[MatchObject]{lastindex} +The integer index of the last matched capturing group, or \code{None} +if no group was matched at all. For example, the expressions +\regexp{(a)b}, \regexp{((a)(b))}, and \regexp{((ab))} will have +\code{lastindex == 1} if applied to the string \code{'ab'}, +while the expression \regexp{(a)(b)} will have \code{lastindex == 2}, +if applied to the same string. +\end{memberdesc} + +\begin{memberdesc}[MatchObject]{lastgroup} +The name of the last matched capturing group, or \code{None} if the +group didn't have a name, or if no group was matched at all. +\end{memberdesc} + +\begin{memberdesc}[MatchObject]{re} +The regular expression object whose \method{match()} or +\method{search()} method produced this \class{MatchObject} instance. +\end{memberdesc} + +\begin{memberdesc}[MatchObject]{string} +The string passed to \function{match()} or \function{search()}. +\end{memberdesc} + +\subsection{Examples} + +\leftline{\strong{Simulating \cfunction{scanf()}}} + +Python does not currently have an equivalent to \cfunction{scanf()}. +\ttindex{scanf()} +Regular expressions are generally more powerful, though also more +verbose, than \cfunction{scanf()} format strings. The table below +offers some more-or-less equivalent mappings between +\cfunction{scanf()} format tokens and regular expressions. + +\begin{tableii}{l|l}{textrm}{\cfunction{scanf()} Token}{Regular Expression} + \lineii{\code{\%c}} + {\regexp{.}} + \lineii{\code{\%5c}} + {\regexp{.\{5\}}} + \lineii{\code{\%d}} + {\regexp{[-+]?\e d+}} + \lineii{\code{\%e}, \code{\%E}, \code{\%f}, \code{\%g}} + {\regexp{[-+]?(\e d+(\e.\e d*)?|\e.\e d+)([eE][-+]?\e d+)?}} + \lineii{\code{\%i}} + {\regexp{[-+]?(0[xX][\e dA-Fa-f]+|0[0-7]*|\e d+)}} + \lineii{\code{\%o}} + {\regexp{0[0-7]*}} + \lineii{\code{\%s}} + {\regexp{\e S+}} + \lineii{\code{\%u}} + {\regexp{\e d+}} + \lineii{\code{\%x}, \code{\%X}} + {\regexp{0[xX][\e dA-Fa-f]+}} +\end{tableii} + +To extract the filename and numbers from a string like + +\begin{verbatim} + /usr/sbin/sendmail - 0 errors, 4 warnings +\end{verbatim} + +you would use a \cfunction{scanf()} format like + +\begin{verbatim} + %s - %d errors, %d warnings +\end{verbatim} + +The equivalent regular expression would be + +\begin{verbatim} + (\S+) - (\d+) errors, (\d+) warnings +\end{verbatim} + +\leftline{\strong{Avoiding recursion}} + +If you create regular expressions that require the engine to perform a +lot of recursion, you may encounter a \exception{RuntimeError} exception with +the message \code{maximum recursion limit} exceeded. For example, + +\begin{verbatim} +>>> import re +>>> s = 'Begin ' + 1000*'a very long string ' + 'end' +>>> re.match('Begin (\w| )*? end', s).end() +Traceback (most recent call last): + File "<stdin>", line 1, in ? + File "/usr/local/lib/python2.5/re.py", line 132, in match + return _compile(pattern, flags).match(string) +RuntimeError: maximum recursion limit exceeded +\end{verbatim} + +You can often restructure your regular expression to avoid recursion. + +Starting with Python 2.3, simple uses of the \regexp{*?} pattern are +special-cased to avoid recursion. Thus, the above regular expression +can avoid recursion by being recast as +\regexp{Begin [a-zA-Z0-9_ ]*?end}. As a further benefit, such regular +expressions will run faster than their recursive equivalents. |