More documentation
This commit is contained in:
parent
8a9cfdb118
commit
29c9048c9e
24
Makefile.am
24
Makefile.am
@ -36,6 +36,11 @@ dist_html_DATA = \
|
||||
doc/html/pcre2matching.html \
|
||||
doc/html/pcre2partial.html \
|
||||
doc/html/pcre2pattern.html \
|
||||
doc/html/pcre2perform.html \
|
||||
doc/html/pcre2posix.html \
|
||||
doc/html/pcre2sample.html \
|
||||
doc/html/pcre2stack.html \
|
||||
doc/html/pcre2syntax.html \
|
||||
doc/html/pcre2test.html \
|
||||
doc/html/pcre2unicode.html
|
||||
|
||||
@ -66,12 +71,7 @@ dist_html_DATA = \
|
||||
# doc/html/pcre2_utf16_to_host_byte_order.html \
|
||||
# doc/html/pcre2_utf32_to_host_byte_order.html \
|
||||
# doc/html/pcre2_version.html \
|
||||
# doc/html/pcre2perform.html \
|
||||
# doc/html/pcre2posix.html \
|
||||
# doc/html/pcre2precompile.html \
|
||||
# doc/html/pcre2sample.html \
|
||||
# doc/html/pcre2stack.html \
|
||||
# doc/html/pcre2syntax.html
|
||||
# doc/html/pcre2precompile.html
|
||||
|
||||
# FIXME
|
||||
dist_man_MANS = \
|
||||
@ -88,6 +88,11 @@ dist_man_MANS = \
|
||||
doc/pcre2matching.3 \
|
||||
doc/pcre2partial.3 \
|
||||
doc/pcre2pattern.3 \
|
||||
doc/pcre2perform.3 \
|
||||
doc/pcre2posix.3 \
|
||||
doc/pcre2sample.3 \
|
||||
doc/pcre2stack.3 \
|
||||
doc/pcre2syntax.3 \
|
||||
doc/pcre2test.1 \
|
||||
doc/pcre2unicode.3
|
||||
|
||||
@ -120,12 +125,7 @@ dist_man_MANS = \
|
||||
# doc/pcre2_utf16_to_host_byte_order.3 \
|
||||
# doc/pcre2_utf32_to_host_byte_order.3 \
|
||||
# doc/pcre2_version.3 \
|
||||
# doc/pcre2perform.3 \
|
||||
# doc/pcre2posix.3 \
|
||||
# doc/pcre2precompile.3 \
|
||||
# doc/pcre2sample.3 \
|
||||
# doc/pcre2stack.3 \
|
||||
# doc/pcre2syntax.3
|
||||
# doc/pcre2precompile.3
|
||||
|
||||
# The Libtool libraries to install. We'll add to this later.
|
||||
|
||||
|
196
doc/html/pcre2perform.html
Normal file
196
doc/html/pcre2perform.html
Normal file
@ -0,0 +1,196 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2perform specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2perform man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
PCRE2 PERFORMANCE
|
||||
</b><br>
|
||||
<P>
|
||||
Two aspects of performance are discussed below: memory usage and processing
|
||||
time. The way you express your pattern as a regular expression can affect both
|
||||
of them.
|
||||
</P>
|
||||
<br><b>
|
||||
COMPILED PATTERN MEMORY USAGE
|
||||
</b><br>
|
||||
<P>
|
||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||
so that most simple patterns do not use much memory. However, there is one case
|
||||
where the memory usage of a compiled pattern can be unexpectedly large. If a
|
||||
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
|
||||
a limited maximum, the whole subpattern is repeated in the compiled code. For
|
||||
example, the pattern
|
||||
<pre>
|
||||
(abc|def){2,4}
|
||||
</pre>
|
||||
is compiled as if it were
|
||||
<pre>
|
||||
(abc|def)(abc|def)((abc|def)(abc|def)?)?
|
||||
</pre>
|
||||
(Technical aside: It is done this way so that backtrack points within each of
|
||||
the repetitions can be independently maintained.)
|
||||
</P>
|
||||
<P>
|
||||
For regular expressions whose quantifiers use only small numbers, this is not
|
||||
usually a problem. However, if the numbers are large, and particularly if such
|
||||
repetitions are nested, the memory usage can become an embarrassment. For
|
||||
example, the very simple pattern
|
||||
<pre>
|
||||
((ab){1,1000}c){1,3}
|
||||
</pre>
|
||||
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
|
||||
with its default internal pointer size of two bytes, the size limit on a
|
||||
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
|
||||
is reached with the above pattern if the outer repetition is increased from 3
|
||||
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
|
||||
larger compiled patterns, but it is better to try to rewrite your pattern to
|
||||
use less memory if you can.
|
||||
</P>
|
||||
<P>
|
||||
One way of reducing the memory usage for such patterns is to make use of
|
||||
PCRE2's
|
||||
<a href="pcre2pattern.html#subpatternsassubroutines">"subroutine"</a>
|
||||
facility. Re-writing the above pattern as
|
||||
<pre>
|
||||
((ab)(?2){0,999}c)(?1){0,2}
|
||||
</pre>
|
||||
reduces the memory requirements to 18K, and indeed it remains under 20K even
|
||||
with the outer repetition increased to 100. However, this pattern is not
|
||||
exactly equivalent, because the "subroutine" calls are treated as
|
||||
<a href="pcre2pattern.html#atomicgroup">atomic groups</a>
|
||||
into which there can be no backtracking if there is a subsequent matching
|
||||
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
|
||||
Furthermore, there is a noticeable loss of speed when executing the modified
|
||||
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
|
||||
speed is acceptable, this kind of rewriting will allow you to process patterns
|
||||
that PCRE2 cannot otherwise handle.
|
||||
</P>
|
||||
<br><b>
|
||||
STACK USAGE AT RUN TIME
|
||||
</b><br>
|
||||
<P>
|
||||
When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
|
||||
cause it to use large amounts of the process stack. In some environments the
|
||||
default process stack is quite small, and if it runs out the result is often
|
||||
SIGSEGV. Rewriting your pattern can often help. The
|
||||
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
||||
documentation discusses this issue in detail.
|
||||
</P>
|
||||
<br><b>
|
||||
PROCESSING TIME
|
||||
</b><br>
|
||||
<P>
|
||||
Certain items in regular expression patterns are processed more efficiently
|
||||
than others. It is more efficient to use a character class like [aeiou] than a
|
||||
set of single-character alternatives such as (a|e|i|o|u). In general, the
|
||||
simplest construction that provides the required behaviour is usually the most
|
||||
efficient. Jeffrey Friedl's book contains a lot of useful general discussion
|
||||
about optimizing regular expressions for efficient performance. This document
|
||||
contains a few observations about PCRE2.
|
||||
</P>
|
||||
<P>
|
||||
Using Unicode character properties (the \p, \P, and \X escapes) is slow,
|
||||
because PCRE2 has to use a multi-stage table lookup whenever it needs a
|
||||
character's property. If you can find an alternative pattern that does not use
|
||||
character properties, it will probably be faster.
|
||||
</P>
|
||||
<P>
|
||||
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
|
||||
character classes such as [:alpha:] do not use Unicode properties, partly for
|
||||
backwards compatibility, and partly for performance reasons. However, you can
|
||||
set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode
|
||||
character properties to be used. This can double the matching time for items
|
||||
such as \d, when matched with <b>pcre2_match()</b>; the performance loss is
|
||||
less with a DFA matching function, and in both cases there is not much
|
||||
difference for \b.
|
||||
</P>
|
||||
<P>
|
||||
When a pattern begins with .* not in parentheses, or in parentheses that are
|
||||
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
|
||||
pattern is implicitly anchored by PCRE2, since it can match only at the start
|
||||
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
|
||||
this optimization, because the dot metacharacter does not then match a newline,
|
||||
and if the subject string contains newlines, the pattern may match from the
|
||||
character immediately following one of them instead of from the very start. For
|
||||
example, the pattern
|
||||
<pre>
|
||||
.*second
|
||||
</pre>
|
||||
matches the subject "first\nand second" (where \n stands for a newline
|
||||
character), with the match starting at the seventh character. In order to do
|
||||
this, PCRE2 has to retry the match starting after every newline in the subject.
|
||||
</P>
|
||||
<P>
|
||||
If you are using such a pattern with subject strings that do not contain
|
||||
newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting
|
||||
the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2
|
||||
from having to scan along the subject looking for a newline to restart at.
|
||||
</P>
|
||||
<P>
|
||||
Beware of patterns that contain nested indefinite repeats. These can take a
|
||||
long time to run when applied to a string that does not match. Consider the
|
||||
pattern fragment
|
||||
<pre>
|
||||
^(a+)*
|
||||
</pre>
|
||||
This can match "aaaa" in 16 different ways, and this number increases very
|
||||
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
|
||||
times, and for each of those cases other than 0 or 4, the + repeats can match
|
||||
different numbers of times.) When the remainder of the pattern is such that the
|
||||
entire match is going to fail, PCRE2 has in principle to try every possible
|
||||
variation, and this can take an extremely long time, even for relatively short
|
||||
strings.
|
||||
</P>
|
||||
<P>
|
||||
An optimization catches some of the more simple cases such as
|
||||
<pre>
|
||||
(a+)*b
|
||||
</pre>
|
||||
where a literal character follows. Before embarking on the standard matching
|
||||
procedure, PCRE2 checks that there is a "b" later in the subject string, and if
|
||||
there is not, it fails the match immediately. However, when there is no
|
||||
following literal this optimization cannot be used. You can see the difference
|
||||
by comparing the behaviour of
|
||||
<pre>
|
||||
(a+)*\d
|
||||
</pre>
|
||||
with the pattern above. The former gives a failure almost instantly when
|
||||
applied to a whole line of "a" characters, whereas the latter takes an
|
||||
appreciable time with strings longer than about 20 characters.
|
||||
</P>
|
||||
<P>
|
||||
In many cases, the solution to this kind of performance issue is to use an
|
||||
atomic group or a possessive quantifier.
|
||||
</P>
|
||||
<br><b>
|
||||
AUTHOR
|
||||
</b><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
University Computing Service
|
||||
<br>
|
||||
Cambridge CB2 3QH, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><b>
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 20 October 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
292
doc/html/pcre2posix.html
Normal file
292
doc/html/pcre2posix.html
Normal file
@ -0,0 +1,292 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2posix specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2posix man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
|
||||
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
|
||||
<li><a name="TOC3" href="#SEC3">COMPILING A PATTERN</a>
|
||||
<li><a name="TOC4" href="#SEC4">MATCHING NEWLINE CHARACTERS</a>
|
||||
<li><a name="TOC5" href="#SEC5">MATCHING A PATTERN</a>
|
||||
<li><a name="TOC6" href="#SEC6">ERROR MESSAGES</a>
|
||||
<li><a name="TOC7" href="#SEC7">MEMORY USAGE</a>
|
||||
<li><a name="TOC8" href="#SEC8">AUTHOR</a>
|
||||
<li><a name="TOC9" href="#SEC9">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
|
||||
<P>
|
||||
<b>#include <pcre2posix.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int regcomp(regex_t *<i>preg</i>, const char *<i>pattern</i>,</b>
|
||||
<b> int <i>cflags</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int regexec(const regex_t *<i>preg</i>, const char *<i>string</i>,</b>
|
||||
<b> size_t <i>nmatch</i>, regmatch_t <i>pmatch</i>[], int <i>eflags</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>size_t regerror(int <i>errcode</i>, const regex_t *<i>preg</i>,</b>
|
||||
<b> char *<i>errbuf</i>, size_t <i>errbuf_size</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void regfree(regex_t *<i>preg</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
|
||||
<P>
|
||||
This set of functions provides a POSIX-style API for the PCRE2 regular
|
||||
expression 8-bit library. See the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation for a description of PCRE2's native API, which contains much
|
||||
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
|
||||
and 32-bit libraries.
|
||||
</P>
|
||||
<P>
|
||||
The functions described here are just wrapper functions that ultimately call
|
||||
the PCRE2 native API. Their prototypes are defined in the <b>pcre2posix.h</b>
|
||||
header file, and on Unix systems the library itself is called
|
||||
<b>libpcre2-posix.a</b>, so can be accessed by adding <b>-lpcre2-posix</b> to the
|
||||
command for linking an application that uses them. Because the POSIX functions
|
||||
call the native ones, it is also necessary to add <b>-lpcre2-8</b>.
|
||||
</P>
|
||||
<P>
|
||||
Those POSIX option bits that can reasonably be mapped to PCRE2 native options
|
||||
have been implemented. In addition, the option REG_EXTENDED is defined with the
|
||||
value zero. This has no effect, but since programs that are written to the
|
||||
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
|
||||
replacement library. Other POSIX options are not even defined.
|
||||
</P>
|
||||
<P>
|
||||
There are also some other options that are not defined by POSIX. These have
|
||||
been added at the request of users who want to make use of certain
|
||||
PCRE2-specific features via the POSIX calling interface.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
||||
in style. The syntax and semantics of the regular expressions themselves are
|
||||
still those of Perl, subject to the setting of various PCRE2 options, as
|
||||
described below. "POSIX-like in style" means that the API approximates to the
|
||||
POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding
|
||||
domains it is probably even less compatible.
|
||||
</P>
|
||||
<P>
|
||||
The header for these functions is supplied as <b>pcre2posix.h</b> to avoid any
|
||||
potential clash with other POSIX libraries. It can, of course, be renamed or
|
||||
aliased as <b>regex.h</b>, which is the "correct" name. It provides two
|
||||
structure types, <i>regex_t</i> for compiled internal forms, and
|
||||
<i>regmatch_t</i> for returning captured substrings. It also defines some
|
||||
constants whose names start with "REG_"; these are used for setting options and
|
||||
identifying error codes.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
|
||||
<P>
|
||||
The function <b>regcomp()</b> is called to compile a pattern into an
|
||||
internal form. The pattern is a C string terminated by a binary zero, and
|
||||
is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer
|
||||
to a <b>regex_t</b> structure that is used as a base for storing information
|
||||
about the compiled regular expression.
|
||||
</P>
|
||||
<P>
|
||||
The argument <i>cflags</i> is either zero, or contains one or more of the bits
|
||||
defined by the following macros:
|
||||
<pre>
|
||||
REG_DOTALL
|
||||
</pre>
|
||||
The PCRE2_DOTALL option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that REG_DOTALL is not part of the
|
||||
POSIX standard.
|
||||
<pre>
|
||||
REG_ICASE
|
||||
</pre>
|
||||
The PCRE2_CASELESS option is set when the regular expression is passed for
|
||||
compilation to the native function.
|
||||
<pre>
|
||||
REG_NEWLINE
|
||||
</pre>
|
||||
The PCRE2_MULTILINE option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that this does <i>not</i> mimic the
|
||||
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||
<pre>
|
||||
REG_NOSUB
|
||||
</pre>
|
||||
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
|
||||
for compilation to the native function. In addition, when a pattern that is
|
||||
compiled with this flag is passed to <b>regexec()</b> for matching, the
|
||||
<i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings
|
||||
are returned.
|
||||
<pre>
|
||||
REG_UCP
|
||||
</pre>
|
||||
The PCRE2_UCP option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes PCRE2 to use Unicode properties
|
||||
when matchine \d, \w, etc., instead of just recognizing ASCII values. Note
|
||||
that REG_UCP is not part of the POSIX standard.
|
||||
<pre>
|
||||
REG_UNGREEDY
|
||||
</pre>
|
||||
The PCRE2_UNGREEDY option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that REG_UNGREEDY is not part of the
|
||||
POSIX standard.
|
||||
<pre>
|
||||
REG_UTF
|
||||
</pre>
|
||||
The PCRE2_UTF option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes the pattern itself and all data
|
||||
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF
|
||||
is not part of the POSIX standard.
|
||||
</P>
|
||||
<P>
|
||||
In the absence of these flags, no options are passed to the native function.
|
||||
This means the the regex is compiled with PCRE2 default semantics. In
|
||||
particular, the way it handles newline characters in the subject string is the
|
||||
Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only
|
||||
<i>some</i> of the effects specified for REG_NEWLINE. It does not affect the way
|
||||
newlines are matched by the dot metacharacter (they are not) or by a negative
|
||||
class such as [^a] (they are).
|
||||
</P>
|
||||
<P>
|
||||
The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
|
||||
<i>preg</i> structure is filled in on success, and one member of the structure
|
||||
is public: <i>re_nsub</i> contains the number of capturing subpatterns in
|
||||
the regular expression. Various error codes are defined in the header file.
|
||||
</P>
|
||||
<P>
|
||||
NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
|
||||
use the contents of the <i>preg</i> structure. If, for example, you pass it to
|
||||
<b>regexec()</b>, the result is undefined and your program is likely to crash.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">MATCHING NEWLINE CHARACTERS</a><br>
|
||||
<P>
|
||||
This area is not simple, because POSIX and Perl take different views of things.
|
||||
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
|
||||
never intended to be a POSIX engine. The following table lists the different
|
||||
possibilities for matching newline characters in PCRE2:
|
||||
<pre>
|
||||
Default Change with
|
||||
|
||||
. matches newline no PCRE2_DOTALL
|
||||
newline matches [^a] yes not changeable
|
||||
$ matches \n at end yes PCRE2_DOLLAR_ENDONLY
|
||||
$ matches \n in middle no PCRE2_MULTILINE
|
||||
^ matches \n in middle no PCRE2_MULTILINE
|
||||
</pre>
|
||||
This is the equivalent table for POSIX:
|
||||
<pre>
|
||||
Default Change with
|
||||
|
||||
. matches newline yes REG_NEWLINE
|
||||
newline matches [^a] yes REG_NEWLINE
|
||||
$ matches \n at end no REG_NEWLINE
|
||||
$ matches \n in middle no REG_NEWLINE
|
||||
^ matches \n in middle no REG_NEWLINE
|
||||
</pre>
|
||||
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
|
||||
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
|
||||
newline from matching [^a].
|
||||
</P>
|
||||
<P>
|
||||
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
|
||||
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
|
||||
the REG_NEWLINE action.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">MATCHING A PATTERN</a><br>
|
||||
<P>
|
||||
The function <b>regexec()</b> is called to match a compiled pattern <i>preg</i>
|
||||
against a given <i>string</i>, which is by default terminated by a zero byte
|
||||
(but see REG_STARTEND below), subject to the options in <i>eflags</i>. These can
|
||||
be:
|
||||
<pre>
|
||||
REG_NOTBOL
|
||||
</pre>
|
||||
The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching
|
||||
function.
|
||||
<pre>
|
||||
REG_NOTEMPTY
|
||||
</pre>
|
||||
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching
|
||||
function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
|
||||
setting this option can give more POSIX-like behaviour in some situations.
|
||||
<pre>
|
||||
REG_NOTEOL
|
||||
</pre>
|
||||
The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching
|
||||
function.
|
||||
<pre>
|
||||
REG_STARTEND
|
||||
</pre>
|
||||
The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and
|
||||
to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
|
||||
(there need not actually be a NUL at that location), regardless of the value of
|
||||
<i>nmatch</i>. This is a BSD extension, compatible with but not specified by
|
||||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
|
||||
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
||||
how it is matched.
|
||||
</P>
|
||||
<P>
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||
strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
|
||||
<b>regexec()</b> are ignored.
|
||||
</P>
|
||||
<P>
|
||||
If the value of <i>nmatch</i> is zero, or if the value <i>pmatch</i> is NULL,
|
||||
no data about any matched strings is returned.
|
||||
</P>
|
||||
<P>
|
||||
Otherwise,the portion of the string that was matched, and also any captured
|
||||
substrings, are returned via the <i>pmatch</i> argument, which points to an
|
||||
array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
|
||||
members <i>rm_so</i> and <i>rm_eo</i>. These contain the byte offset to the first
|
||||
character of each substring and the offset to the first character after the end
|
||||
of each substring, respectively. The 0th element of the vector relates to the
|
||||
entire portion of <i>string</i> that was matched; subsequent elements relate to
|
||||
the capturing subpatterns of the regular expression. Unused entries in the
|
||||
array have both structure members set to -1.
|
||||
</P>
|
||||
<P>
|
||||
A successful match yields a zero return; various error codes are defined in the
|
||||
header file, of which REG_NOMATCH is the "expected" failure code.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">ERROR MESSAGES</a><br>
|
||||
<P>
|
||||
The <b>regerror()</b> function maps a non-zero errorcode from either
|
||||
<b>regcomp()</b> or <b>regexec()</b> to a printable message. If <i>preg</i> is not
|
||||
NULL, the error should have arisen from the use of that structure. A message
|
||||
terminated by a binary zero is placed in <i>errbuf</i>. The length of the
|
||||
message, including the zero, is limited to <i>errbuf_size</i>. The yield of the
|
||||
function is the size of buffer needed to hold the whole message.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">MEMORY USAGE</a><br>
|
||||
<P>
|
||||
Compiling a regular expression causes memory to be allocated and associated
|
||||
with the <i>preg</i> structure. The function <b>regfree()</b> frees all such
|
||||
memory, after which <i>preg</i> may no longer be used as a compiled expression.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
University Computing Service
|
||||
<br>
|
||||
Cambridge CB2 3QH, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 20 October 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
106
doc/html/pcre2sample.html
Normal file
106
doc/html/pcre2sample.html
Normal file
@ -0,0 +1,106 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2sample specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2sample man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
PCRE2 SAMPLE PROGRAM
|
||||
</b><br>
|
||||
<P>
|
||||
A simple, complete demonstration program to get you started with using PCRE2 is
|
||||
supplied in the file <i>pcre2demo.c</i> in the <b>src</b> directory in the PCRE2
|
||||
distribution. A listing of this program is given in the
|
||||
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||
documentation. If you do not have a copy of the PCRE2 distribution, you can
|
||||
save this listing to re-create the contents of <i>pcre2demo.c</i>.
|
||||
</P>
|
||||
<P>
|
||||
The demonstration program, which uses the PCRE2 8-bit library, compiles the
|
||||
regular expression that is its first argument, and matches it against the
|
||||
subject string in its second argument. No PCRE2 options are set, and default
|
||||
character tables are used. If matching succeeds, the program outputs the
|
||||
portion of the subject that matched, together with the contents of any captured
|
||||
substrings.
|
||||
</P>
|
||||
<P>
|
||||
If the -g option is given on the command line, the program then goes on to
|
||||
check for further matches of the same regular expression in the same subject
|
||||
string. The logic is a little bit tricky because of the possibility of matching
|
||||
an empty string. Comments in the code explain what is going on.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2 is installed in the standard include and library directories for your
|
||||
operating system, you should be able to compile the demonstration program using
|
||||
this command:
|
||||
<pre>
|
||||
gcc -o pcre2demo pcre2demo.c -lpcre2-8
|
||||
</pre>
|
||||
If PCRE2 is installed elsewhere, you may need to add additional options to the
|
||||
command line. For example, on a Unix-like system that has PCRE2 installed in
|
||||
<i>/usr/local</i>, you can compile the demonstration program using a command
|
||||
like this:
|
||||
<pre>
|
||||
gcc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
Once you have compiled and linked the demonstration program, you can run simple
|
||||
tests like this:
|
||||
<pre>
|
||||
./pcre2demo 'cat|dog' 'the cat sat on the mat'
|
||||
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
|
||||
</pre>
|
||||
Note that there is a much more comprehensive test program, called
|
||||
<a href="pcre2test.html"><b>pcre2test</b>,</a>
|
||||
which supports many more facilities for testing regular expressions using the
|
||||
PCRE2 libraries. The
|
||||
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||
program is provided as a simple coding example.
|
||||
</P>
|
||||
<P>
|
||||
If you try to run
|
||||
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||
when PCRE2 is not installed in the standard library directory, you may get an
|
||||
error like this on some operating systems (e.g. Solaris):
|
||||
<pre>
|
||||
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
|
||||
</pre>
|
||||
This is caused by the way shared library support works on those systems. You
|
||||
need to add
|
||||
<pre>
|
||||
-R/usr/local/lib
|
||||
</pre>
|
||||
(for example) to the compile command to get round this problem.
|
||||
</P>
|
||||
<br><b>
|
||||
AUTHOR
|
||||
</b><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
University Computing Service
|
||||
<br>
|
||||
Cambridge CB2 3QH, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><b>
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 20 October 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
203
doc/html/pcre2stack.html
Normal file
203
doc/html/pcre2stack.html
Normal file
@ -0,0 +1,203 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2stack specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2stack man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
PCRE2 DISCUSSION OF STACK USAGE
|
||||
</b><br>
|
||||
<P>
|
||||
When you call <b>pcre2_match()</b>, it makes use of an internal function called
|
||||
<b>match()</b>. This calls itself recursively at branch points in the pattern,
|
||||
in order to remember the state of the match so that it can back up and try a
|
||||
different alternative after a failure. As matching proceeds deeper and deeper
|
||||
into the tree of possibilities, the recursion depth increases. The
|
||||
<b>match()</b> function is also called in other circumstances, for example,
|
||||
whenever a parenthesized sub-pattern is entered, and in certain cases of
|
||||
repetition.
|
||||
</P>
|
||||
<P>
|
||||
Not all calls of <b>match()</b> increase the recursion depth; for an item such
|
||||
as a* it may be called several times at the same level, after matching
|
||||
different numbers of a's. Furthermore, in a number of cases where the result of
|
||||
the recursive call would immediately be passed back as the result of the
|
||||
current call (a "tail recursion"), the function is just restarted instead.
|
||||
</P>
|
||||
<P>
|
||||
The above comments apply when <b>pcre2_match()</b> is run in its normal
|
||||
interpretive manner. If the compiled pattern was processed by
|
||||
<b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the
|
||||
options passed to <b>pcre2_match()</b> were not incompatible, the matching
|
||||
process uses the JIT-compiled code instead of the <b>match()</b> function. In
|
||||
this case, the memory requirements are handled entirely differently. See the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for details.
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_dfa_match()</b> function operates in a different way to
|
||||
<b>pcre2_match()</b>, and uses recursion only when there is a regular expression
|
||||
recursion or subroutine call in the pattern. This includes the processing of
|
||||
assertion and "once-only" subpatterns, which are handled like subroutine calls.
|
||||
Normally, these are never very deep, and the limit on the complexity of
|
||||
<b>pcre2_dfa_match()</b> is controlled by the amount of workspace it is given.
|
||||
However, it is possible to write patterns with runaway infinite recursions;
|
||||
such patterns will cause <b>pcre2_dfa_match()</b> to run out of stack. At
|
||||
present, there is no protection against this.
|
||||
</P>
|
||||
<P>
|
||||
The comments that follow do NOT apply to <b>pcre2_dfa_match()</b>; they are
|
||||
relevant only for <b>pcre2_match()</b> without the JIT optimization.
|
||||
</P>
|
||||
<br><b>
|
||||
Reducing <b>pcre2_match()</b>'s stack usage
|
||||
</b><br>
|
||||
<P>
|
||||
Each time that the internal <b>match()</b> function is called recursively, it
|
||||
uses memory from the process stack. For certain kinds of pattern and data, very
|
||||
large amounts of stack may be needed, despite the recognition of "tail
|
||||
recursion". You can often reduce the amount of recursion, and therefore the
|
||||
amount of stack used, by modifying the pattern that is being matched. Consider,
|
||||
for example, this pattern:
|
||||
<pre>
|
||||
([^<]|<(?!inet))+
|
||||
</pre>
|
||||
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||
the data, and is the kind of pattern that might be used when processing an XML
|
||||
file. Each iteration of the outer parentheses matches either one character that
|
||||
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||
parenthesis is processed, a recursion occurs, so this formulation uses a stack
|
||||
frame for each matched character. For a long string, a lot of stack is
|
||||
required. Consider now this rewritten pattern, which matches exactly the same
|
||||
strings:
|
||||
<pre>
|
||||
([^<]++|<(?!inet))+
|
||||
</pre>
|
||||
This uses very much less stack, because runs of characters that do not contain
|
||||
"<" are "swallowed" in one item inside the parentheses. Recursion happens only
|
||||
when a "<" character that is not followed by "inet" is encountered (and we
|
||||
assume this is relatively rare). A possessive quantifier is used to stop any
|
||||
backtracking into the runs of non-"<" characters, but that is not related to
|
||||
stack usage.
|
||||
</P>
|
||||
<P>
|
||||
This example shows that one way of avoiding stack problems when matching long
|
||||
subject strings is to write repeated parenthesized subpatterns to match more
|
||||
than one character whenever possible.
|
||||
</P>
|
||||
<br><b>
|
||||
Compiling PCRE2 to use heap instead of stack for <b>pcre2_match()</b>
|
||||
</b><br>
|
||||
<P>
|
||||
In environments where stack memory is constrained, you might want to compile
|
||||
PCRE2 to use heap memory instead of stack for remembering back-up points when
|
||||
<b>pcre2_match()</b> is running. This makes it run more slowly, however. Details
|
||||
of how to do this are given in the
|
||||
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||
documentation. When built in this way, instead of using the stack, PCRE2
|
||||
gets memory for remembering backup points from the heap. By default, the memory
|
||||
is obtained by calling the system <b>malloc()</b> function, but you can arrange
|
||||
to supply your own memory management function. For details, see the section
|
||||
entitled
|
||||
<a href="pcre2api.html#matchcontext">"The match context"</a>
|
||||
in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation. Since the block sizes are always the same, it may be possible to
|
||||
implement customized a memory handler that is more efficient than the standard
|
||||
function. The memory blocks obtained for this purpose are retained and re-used
|
||||
if possible while <b>pcre2_match()</b> is running. They are all freed just
|
||||
before it exits.
|
||||
</P>
|
||||
<br><b>
|
||||
Limiting <b>pcre2_match()</b>'s stack usage
|
||||
</b><br>
|
||||
<P>
|
||||
You can set limits on the number of times the internal <b>match()</b> function
|
||||
is called, both in total and recursively. If a limit is exceeded,
|
||||
<b>pcre2_match()</b> returns an error code. Setting suitable limits should
|
||||
prevent it from running out of stack. The default values of the limits are very
|
||||
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
|
||||
and they can also be set when <b>pcre2_match()</b> is called. For details of
|
||||
these interfaces, see the
|
||||
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||
documentation and the section entitled
|
||||
<a href="pcre2api.html#matchcontext">"The match context"</a>
|
||||
in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
As a very rough rule of thumb, you should reckon on about 500 bytes per
|
||||
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
|
||||
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
|
||||
around 128000 recursions.
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
|
||||
applied to a subject line, causes it to find the smallest limits that allow a a
|
||||
pattern to match. This is done by calling <b>pcre2_match()</b> repeatedly with
|
||||
different limits.
|
||||
</P>
|
||||
<br><b>
|
||||
Changing stack size in Unix-like systems
|
||||
</b><br>
|
||||
<P>
|
||||
In Unix-like environments, there is not often a problem with the stack unless
|
||||
very long strings are involved, though the default limit on stack size varies
|
||||
from system to system. Values from 8Mb to 64Mb are common. You can find your
|
||||
default limit by running the command:
|
||||
<pre>
|
||||
ulimit -s
|
||||
</pre>
|
||||
Unfortunately, the effect of running out of stack is often SIGSEGV, though
|
||||
sometimes a more explicit error message is given. You can normally increase the
|
||||
limit on stack size by code such as this:
|
||||
<pre>
|
||||
struct rlimit rlim;
|
||||
getrlimit(RLIMIT_STACK, &rlim);
|
||||
rlim.rlim_cur = 100*1024*1024;
|
||||
setrlimit(RLIMIT_STACK, &rlim);
|
||||
</pre>
|
||||
This reads the current limits (soft and hard) using <b>getrlimit()</b>, then
|
||||
attempts to increase the soft limit to 100Mb using <b>setrlimit()</b>. You must
|
||||
do this before calling <b>pcre2_match()</b>.
|
||||
</P>
|
||||
<br><b>
|
||||
Changing stack size in Mac OS X
|
||||
</b><br>
|
||||
<P>
|
||||
Using <b>setrlimit()</b>, as described above, should also work on Mac OS X. It
|
||||
is also possible to set a stack size when linking a program. There is a
|
||||
discussion about stack sizes in Mac OS X at this web site:
|
||||
<a href="http://developer.apple.com/qa/qa2005/qa1419.html">http://developer.apple.com/qa/qa2005/qa1419.html.</a>
|
||||
</P>
|
||||
<br><b>
|
||||
AUTHOR
|
||||
</b><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
University Computing Service
|
||||
<br>
|
||||
Cambridge CB2 3QH, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><b>
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 20 October 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
561
doc/html/pcre2syntax.html
Normal file
561
doc/html/pcre2syntax.html
Normal file
@ -0,0 +1,561 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2syntax specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2syntax man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
|
||||
<li><a name="TOC2" href="#SEC2">QUOTING</a>
|
||||
<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
|
||||
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
|
||||
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
|
||||
<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
|
||||
<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
|
||||
<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
|
||||
<li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
|
||||
<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
|
||||
<li><a name="TOC13" href="#SEC13">CAPTURING</a>
|
||||
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
|
||||
<li><a name="TOC15" href="#SEC15">COMMENT</a>
|
||||
<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
|
||||
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
|
||||
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||
<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
|
||||
<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||
<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
|
||||
<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
|
||||
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
|
||||
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
|
||||
<li><a name="TOC27" href="#SEC27">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||
<P>
|
||||
The full syntax and semantics of the regular expressions that are supported by
|
||||
PCRE2 are described in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation. This document contains a quick-reference summary of the syntax.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\x where x is non-alphanumeric is a literal x
|
||||
\Q...\E treat enclosed characters as literal
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is any ASCII character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n newline (hex 0A)
|
||||
\r carriage return (hex 0D)
|
||||
\t tab (hex 09)
|
||||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh..
|
||||
</pre>
|
||||
Note that \0dd is always an octal code, and that \8 and \9 are the literal
|
||||
characters "8" and "9".
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
. any character except newline;
|
||||
in dotall mode, any character whatsoever
|
||||
\C one data unit, even in UTF mode (best avoided)
|
||||
\d a decimal digit
|
||||
\D a character that is not a decimal digit
|
||||
\h a horizontal white space character
|
||||
\H a character that is not a horizontal white space character
|
||||
\N a character that is not a newline
|
||||
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||
\R a newline sequence
|
||||
\s a white space character
|
||||
\S a character that is not a white space character
|
||||
\v a vertical white space character
|
||||
\V a character that is not a vertical white space character
|
||||
\w a "word" character
|
||||
\W a "non-word" character
|
||||
\X a Unicode extended grapheme cluster
|
||||
</pre>
|
||||
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
|
||||
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
|
||||
happening, \s and \w may also match characters with code points in the range
|
||||
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
|
||||
sequences is changed to use Unicode properties and they match many more
|
||||
characters.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
C Other
|
||||
Cc Control
|
||||
Cf Format
|
||||
Cn Unassigned
|
||||
Co Private use
|
||||
Cs Surrogate
|
||||
|
||||
L Letter
|
||||
Ll Lower case letter
|
||||
Lm Modifier letter
|
||||
Lo Other letter
|
||||
Lt Title case letter
|
||||
Lu Upper case letter
|
||||
L& Ll, Lu, or Lt
|
||||
|
||||
M Mark
|
||||
Mc Spacing mark
|
||||
Me Enclosing mark
|
||||
Mn Non-spacing mark
|
||||
|
||||
N Number
|
||||
Nd Decimal number
|
||||
Nl Letter number
|
||||
No Other number
|
||||
|
||||
P Punctuation
|
||||
Pc Connector punctuation
|
||||
Pd Dash punctuation
|
||||
Pe Close punctuation
|
||||
Pf Final punctuation
|
||||
Pi Initial punctuation
|
||||
Po Other punctuation
|
||||
Ps Open punctuation
|
||||
|
||||
S Symbol
|
||||
Sc Currency symbol
|
||||
Sk Modifier symbol
|
||||
Sm Mathematical symbol
|
||||
So Other symbol
|
||||
|
||||
Z Separator
|
||||
Zl Line separator
|
||||
Zp Paragraph separator
|
||||
Zs Space separator
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
Xan Alphanumeric: union of properties L and N
|
||||
Xps POSIX space: property Z or tab, NL, VT, FF, CR
|
||||
Xsp Perl space: property Z or tab, NL, VT, FF, CR
|
||||
Xuc Univerally-named character: one that can be
|
||||
represented by a Universal Character Name
|
||||
Xwd Perl word: property Xan or underscore
|
||||
</pre>
|
||||
Perl and POSIX space are now the same. Perl added VT to its space character set
|
||||
at release 5.18.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
|
||||
<P>
|
||||
Arabic,
|
||||
Armenian,
|
||||
Avestan,
|
||||
Balinese,
|
||||
Bamum,
|
||||
Bassa_Vah,
|
||||
Batak,
|
||||
Bengali,
|
||||
Bopomofo,
|
||||
Brahmi,
|
||||
Braille,
|
||||
Buginese,
|
||||
Buhid,
|
||||
Canadian_Aboriginal,
|
||||
Carian,
|
||||
Caucasian_Albanian,
|
||||
Chakma,
|
||||
Cham,
|
||||
Cherokee,
|
||||
Common,
|
||||
Coptic,
|
||||
Cuneiform,
|
||||
Cypriot,
|
||||
Cyrillic,
|
||||
Deseret,
|
||||
Devanagari,
|
||||
Duployan,
|
||||
Egyptian_Hieroglyphs,
|
||||
Elbasan,
|
||||
Ethiopic,
|
||||
Georgian,
|
||||
Glagolitic,
|
||||
Gothic,
|
||||
Grantha,
|
||||
Greek,
|
||||
Gujarati,
|
||||
Gurmukhi,
|
||||
Han,
|
||||
Hangul,
|
||||
Hanunoo,
|
||||
Hebrew,
|
||||
Hiragana,
|
||||
Imperial_Aramaic,
|
||||
Inherited,
|
||||
Inscriptional_Pahlavi,
|
||||
Inscriptional_Parthian,
|
||||
Javanese,
|
||||
Kaithi,
|
||||
Kannada,
|
||||
Katakana,
|
||||
Kayah_Li,
|
||||
Kharoshthi,
|
||||
Khmer,
|
||||
Khojki,
|
||||
Khudawadi,
|
||||
Lao,
|
||||
Latin,
|
||||
Lepcha,
|
||||
Limbu,
|
||||
Linear_A,
|
||||
Linear_B,
|
||||
Lisu,
|
||||
Lycian,
|
||||
Lydian,
|
||||
Mahajani,
|
||||
Malayalam,
|
||||
Mandaic,
|
||||
Manichaean,
|
||||
Meetei_Mayek,
|
||||
Mende_Kikakui,
|
||||
Meroitic_Cursive,
|
||||
Meroitic_Hieroglyphs,
|
||||
Miao,
|
||||
Modi,
|
||||
Mongolian,
|
||||
Mro,
|
||||
Myanmar,
|
||||
Nabataean,
|
||||
New_Tai_Lue,
|
||||
Nko,
|
||||
Ogham,
|
||||
Ol_Chiki,
|
||||
Old_Italic,
|
||||
Old_North_Arabian,
|
||||
Old_Permic,
|
||||
Old_Persian,
|
||||
Old_South_Arabian,
|
||||
Old_Turkic,
|
||||
Oriya,
|
||||
Osmanya,
|
||||
Pahawh_Hmong,
|
||||
Palmyrene,
|
||||
Pau_Cin_Hau,
|
||||
Phags_Pa,
|
||||
Phoenician,
|
||||
Psalter_Pahlavi,
|
||||
Rejang,
|
||||
Runic,
|
||||
Samaritan,
|
||||
Saurashtra,
|
||||
Sharada,
|
||||
Shavian,
|
||||
Siddham,
|
||||
Sinhala,
|
||||
Sora_Sompeng,
|
||||
Sundanese,
|
||||
Syloti_Nagri,
|
||||
Syriac,
|
||||
Tagalog,
|
||||
Tagbanwa,
|
||||
Tai_Le,
|
||||
Tai_Tham,
|
||||
Tai_Viet,
|
||||
Takri,
|
||||
Tamil,
|
||||
Telugu,
|
||||
Thaana,
|
||||
Thai,
|
||||
Tibetan,
|
||||
Tifinagh,
|
||||
Tirhuta,
|
||||
Ugaritic,
|
||||
Vai,
|
||||
Warang_Citi,
|
||||
Yi.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
[...] positive character class
|
||||
[^...] negative character class
|
||||
[x-y] range (can be used for hex characters)
|
||||
[[:xxx:]] positive POSIX named set
|
||||
[[:^xxx:]] negative POSIX named set
|
||||
|
||||
alnum alphanumeric
|
||||
alpha alphabetic
|
||||
ascii 0-127
|
||||
blank space or tab
|
||||
cntrl control character
|
||||
digit decimal digit
|
||||
graph printing, excluding space
|
||||
lower lower case letter
|
||||
print printing, including space
|
||||
punct printing, excluding alphanumeric
|
||||
space white space
|
||||
upper upper case letter
|
||||
word same as \w
|
||||
xdigit hexadecimal digit
|
||||
</pre>
|
||||
In PCRE2, POSIX character set names recognize only ASCII characters by default,
|
||||
but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||
\Q...\E inside a character class.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
? 0 or 1, greedy
|
||||
?+ 0 or 1, possessive
|
||||
?? 0 or 1, lazy
|
||||
* 0 or more, greedy
|
||||
*+ 0 or more, possessive
|
||||
*? 0 or more, lazy
|
||||
+ 1 or more, greedy
|
||||
++ 1 or more, possessive
|
||||
+? 1 or more, lazy
|
||||
{n} exactly n
|
||||
{n,m} at least n, no more than m, greedy
|
||||
{n,m}+ at least n, no more than m, possessive
|
||||
{n,m}? at least n, no more than m, lazy
|
||||
{n,} n or more, greedy
|
||||
{n,}+ n or more, possessive
|
||||
{n,}? n or more, lazy
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\b word boundary
|
||||
\B not a word boundary
|
||||
^ start of subject
|
||||
also after internal newline in multiline mode
|
||||
\A start of subject
|
||||
$ end of subject
|
||||
also before newline at end of subject
|
||||
also before internal newline in multiline mode
|
||||
\Z end of subject
|
||||
also before newline at end of subject
|
||||
\z end of subject
|
||||
\G first matching position in subject
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\K reset start of match
|
||||
</pre>
|
||||
\K is honoured in positive assertions, but ignored in negative ones.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
expr|expr|expr...
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(...) capturing group
|
||||
(?<name>...) named capturing group (Perl)
|
||||
(?'name'...) named capturing group (Perl)
|
||||
(?P<name>...) named capturing group (Python)
|
||||
(?:...) non-capturing group
|
||||
(?|...) non-capturing group; reset group numbers for
|
||||
capturing groups in each alternative
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?>...) atomic, non-capturing group
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?#....) comment (not nestable)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?i) caseless
|
||||
(?J) allow duplicate names
|
||||
(?m) multiline
|
||||
(?s) single line (dotall)
|
||||
(?U) default ungreedy (lazy)
|
||||
(?x) extended (ignore white space)
|
||||
(?-...) unset option(s)
|
||||
</pre>
|
||||
The following are recognized only at the very start of a pattern or after one
|
||||
of the newline or \R options with similar syntax. More than one of them may
|
||||
appear.
|
||||
<pre>
|
||||
(*LIMIT_MATCH=d) set the match limit to d (decimal number)
|
||||
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
|
||||
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
||||
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
||||
(*UTF) set appropriate UTF mode for the library in use
|
||||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||
</pre>
|
||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||
limits set by the caller of pcre2_exec(), not increase them.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||
<P>
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
settings with a similar syntax.
|
||||
<pre>
|
||||
(*CR) carriage return only
|
||||
(*LF) linefeed only
|
||||
(*CRLF) carriage return followed by linefeed
|
||||
(*ANYCRLF) all three of the above
|
||||
(*ANY) any Unicode newline sequence
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||
<P>
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
setting with a similar syntax.
|
||||
<pre>
|
||||
(*BSR_ANYCRLF) CR, LF, or CRLF
|
||||
(*BSR_UNICODE) any Unicode newline sequence
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?=...) positive look ahead
|
||||
(?!...) negative look ahead
|
||||
(?<=...) positive look behind
|
||||
(?<!...) negative look behind
|
||||
</pre>
|
||||
Each top-level branch of a look behind must be of a fixed length.
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\n reference by number (can be ambiguous)
|
||||
\gn reference by number
|
||||
\g{n} reference by number
|
||||
\g{-n} relative reference by number
|
||||
\k<name> reference by name (Perl)
|
||||
\k'name' reference by name (Perl)
|
||||
\g{name} reference by name (Perl)
|
||||
\k{name} reference by name (.NET)
|
||||
(?P=name) reference by name (Python)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?R) recurse whole pattern
|
||||
(?n) call subpattern by absolute number
|
||||
(?+n) call subpattern by relative number
|
||||
(?-n) call subpattern by relative number
|
||||
(?&name) call subpattern by name (Perl)
|
||||
(?P>name) call subpattern by name (Python)
|
||||
\g<name> call subpattern by name (Oniguruma)
|
||||
\g'name' call subpattern by name (Oniguruma)
|
||||
\g<n> call subpattern by absolute number (Oniguruma)
|
||||
\g'n' call subpattern by absolute number (Oniguruma)
|
||||
\g<+n> call subpattern by relative number (PCRE2 extension)
|
||||
\g'+n' call subpattern by relative number (PCRE2 extension)
|
||||
\g<-n> call subpattern by relative number (PCRE2 extension)
|
||||
\g'-n' call subpattern by relative number (PCRE2 extension)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?(condition)yes-pattern)
|
||||
(?(condition)yes-pattern|no-pattern)
|
||||
|
||||
(?(n)... absolute reference condition
|
||||
(?(+n)... relative reference condition
|
||||
(?(-n)... relative reference condition
|
||||
(?(<name>)... named reference condition (Perl)
|
||||
(?('name')... named reference condition (Perl)
|
||||
(?(name)... named reference condition (PCRE2)
|
||||
(?(R)... overall recursion condition
|
||||
(?(Rn)... specific group recursion condition
|
||||
(?(R&name)... specific recursion condition
|
||||
(?(DEFINE)... define subpattern for reference
|
||||
(?(assert)... assertion condition
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
The following act immediately they are reached:
|
||||
<pre>
|
||||
(*ACCEPT) force successful match
|
||||
(*FAIL) force backtrack; synonym (*F)
|
||||
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
||||
</pre>
|
||||
The following act only when a subsequent match failure causes a backtrack to
|
||||
reach them. They all force a match failure, but they differ in what happens
|
||||
afterwards. Those that advance the start-of-match point do so only if the
|
||||
pattern is not anchored.
|
||||
<pre>
|
||||
(*COMMIT) overall failure, no advance of starting point
|
||||
(*PRUNE) advance to next starting character
|
||||
(*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
|
||||
(*SKIP) advance to current matching position
|
||||
(*SKIP:NAME) advance to position corresponding to an earlier
|
||||
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
||||
(*THEN) local failure, backtrack to next alternation
|
||||
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?C) callout
|
||||
(?Cn) callout with data n
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
University Computing Service
|
||||
<br>
|
||||
Cambridge CB2 3QH, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 20 October 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
178
doc/pcre2perform.3
Normal file
178
doc/pcre2perform.3
Normal file
@ -0,0 +1,178 @@
|
||||
.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 PERFORMANCE"
|
||||
.rs
|
||||
.sp
|
||||
Two aspects of performance are discussed below: memory usage and processing
|
||||
time. The way you express your pattern as a regular expression can affect both
|
||||
of them.
|
||||
.
|
||||
.SH "COMPILED PATTERN MEMORY USAGE"
|
||||
.rs
|
||||
.sp
|
||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||
so that most simple patterns do not use much memory. However, there is one case
|
||||
where the memory usage of a compiled pattern can be unexpectedly large. If a
|
||||
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
|
||||
a limited maximum, the whole subpattern is repeated in the compiled code. For
|
||||
example, the pattern
|
||||
.sp
|
||||
(abc|def){2,4}
|
||||
.sp
|
||||
is compiled as if it were
|
||||
.sp
|
||||
(abc|def)(abc|def)((abc|def)(abc|def)?)?
|
||||
.sp
|
||||
(Technical aside: It is done this way so that backtrack points within each of
|
||||
the repetitions can be independently maintained.)
|
||||
.P
|
||||
For regular expressions whose quantifiers use only small numbers, this is not
|
||||
usually a problem. However, if the numbers are large, and particularly if such
|
||||
repetitions are nested, the memory usage can become an embarrassment. For
|
||||
example, the very simple pattern
|
||||
.sp
|
||||
((ab){1,1000}c){1,3}
|
||||
.sp
|
||||
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
|
||||
with its default internal pointer size of two bytes, the size limit on a
|
||||
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
|
||||
is reached with the above pattern if the outer repetition is increased from 3
|
||||
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
|
||||
larger compiled patterns, but it is better to try to rewrite your pattern to
|
||||
use less memory if you can.
|
||||
.P
|
||||
One way of reducing the memory usage for such patterns is to make use of
|
||||
PCRE2's
|
||||
.\" HTML <a href="pcre2pattern.html#subpatternsassubroutines">
|
||||
.\" </a>
|
||||
"subroutine"
|
||||
.\"
|
||||
facility. Re-writing the above pattern as
|
||||
.sp
|
||||
((ab)(?2){0,999}c)(?1){0,2}
|
||||
.sp
|
||||
reduces the memory requirements to 18K, and indeed it remains under 20K even
|
||||
with the outer repetition increased to 100. However, this pattern is not
|
||||
exactly equivalent, because the "subroutine" calls are treated as
|
||||
.\" HTML <a href="pcre2pattern.html#atomicgroup">
|
||||
.\" </a>
|
||||
atomic groups
|
||||
.\"
|
||||
into which there can be no backtracking if there is a subsequent matching
|
||||
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
|
||||
Furthermore, there is a noticeable loss of speed when executing the modified
|
||||
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
|
||||
speed is acceptable, this kind of rewriting will allow you to process patterns
|
||||
that PCRE2 cannot otherwise handle.
|
||||
.
|
||||
.
|
||||
.SH "STACK USAGE AT RUN TIME"
|
||||
.rs
|
||||
.sp
|
||||
When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
|
||||
cause it to use large amounts of the process stack. In some environments the
|
||||
default process stack is quite small, and if it runs out the result is often
|
||||
SIGSEGV. Rewriting your pattern can often help. The
|
||||
.\" HREF
|
||||
\fBpcre2stack\fP
|
||||
.\"
|
||||
documentation discusses this issue in detail.
|
||||
.
|
||||
.
|
||||
.SH "PROCESSING TIME"
|
||||
.rs
|
||||
.sp
|
||||
Certain items in regular expression patterns are processed more efficiently
|
||||
than others. It is more efficient to use a character class like [aeiou] than a
|
||||
set of single-character alternatives such as (a|e|i|o|u). In general, the
|
||||
simplest construction that provides the required behaviour is usually the most
|
||||
efficient. Jeffrey Friedl's book contains a lot of useful general discussion
|
||||
about optimizing regular expressions for efficient performance. This document
|
||||
contains a few observations about PCRE2.
|
||||
.P
|
||||
Using Unicode character properties (the \ep, \eP, and \eX escapes) is slow,
|
||||
because PCRE2 has to use a multi-stage table lookup whenever it needs a
|
||||
character's property. If you can find an alternative pattern that does not use
|
||||
character properties, it will probably be faster.
|
||||
.P
|
||||
By default, the escape sequences \eb, \ed, \es, and \ew, and the POSIX
|
||||
character classes such as [:alpha:] do not use Unicode properties, partly for
|
||||
backwards compatibility, and partly for performance reasons. However, you can
|
||||
set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode
|
||||
character properties to be used. This can double the matching time for items
|
||||
such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is
|
||||
less with a DFA matching function, and in both cases there is not much
|
||||
difference for \eb.
|
||||
.P
|
||||
When a pattern begins with .* not in parentheses, or in parentheses that are
|
||||
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
|
||||
pattern is implicitly anchored by PCRE2, since it can match only at the start
|
||||
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
|
||||
this optimization, because the dot metacharacter does not then match a newline,
|
||||
and if the subject string contains newlines, the pattern may match from the
|
||||
character immediately following one of them instead of from the very start. For
|
||||
example, the pattern
|
||||
.sp
|
||||
.*second
|
||||
.sp
|
||||
matches the subject "first\enand second" (where \en stands for a newline
|
||||
character), with the match starting at the seventh character. In order to do
|
||||
this, PCRE2 has to retry the match starting after every newline in the subject.
|
||||
.P
|
||||
If you are using such a pattern with subject strings that do not contain
|
||||
newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting
|
||||
the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2
|
||||
from having to scan along the subject looking for a newline to restart at.
|
||||
.P
|
||||
Beware of patterns that contain nested indefinite repeats. These can take a
|
||||
long time to run when applied to a string that does not match. Consider the
|
||||
pattern fragment
|
||||
.sp
|
||||
^(a+)*
|
||||
.sp
|
||||
This can match "aaaa" in 16 different ways, and this number increases very
|
||||
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
|
||||
times, and for each of those cases other than 0 or 4, the + repeats can match
|
||||
different numbers of times.) When the remainder of the pattern is such that the
|
||||
entire match is going to fail, PCRE2 has in principle to try every possible
|
||||
variation, and this can take an extremely long time, even for relatively short
|
||||
strings.
|
||||
.P
|
||||
An optimization catches some of the more simple cases such as
|
||||
.sp
|
||||
(a+)*b
|
||||
.sp
|
||||
where a literal character follows. Before embarking on the standard matching
|
||||
procedure, PCRE2 checks that there is a "b" later in the subject string, and if
|
||||
there is not, it fails the match immediately. However, when there is no
|
||||
following literal this optimization cannot be used. You can see the difference
|
||||
by comparing the behaviour of
|
||||
.sp
|
||||
(a+)*\ed
|
||||
.sp
|
||||
with the pattern above. The former gives a failure almost instantly when
|
||||
applied to a whole line of "a" characters, whereas the latter takes an
|
||||
appreciable time with strings longer than about 20 characters.
|
||||
.P
|
||||
In many cases, the solution to this kind of performance issue is to use an
|
||||
atomic group or a possessive quantifier.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Cambridge CB2 3QH, England.
|
||||
.fi
|
||||
.
|
||||
.
|
||||
.SH REVISION
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 October 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
268
doc/pcre2posix.3
Normal file
268
doc/pcre2posix.3
Normal file
@ -0,0 +1,268 @@
|
||||
.TH PCRE2POSIX 3 "20 October 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "SYNOPSIS"
|
||||
.rs
|
||||
.sp
|
||||
.B #include <pcre2posix.h>
|
||||
.PP
|
||||
.nf
|
||||
.B int regcomp(regex_t *\fIpreg\fP, const char *\fIpattern\fP,
|
||||
.B " int \fIcflags\fP);"
|
||||
.sp
|
||||
.B int regexec(const regex_t *\fIpreg\fP, const char *\fIstring\fP,
|
||||
.B " size_t \fInmatch\fP, regmatch_t \fIpmatch\fP[], int \fIeflags\fP);"
|
||||
.sp
|
||||
.B "size_t regerror(int \fIerrcode\fP, const regex_t *\fIpreg\fP,"
|
||||
.B " char *\fIerrbuf\fP, size_t \fIerrbuf_size\fP);"
|
||||
.sp
|
||||
.B void regfree(regex_t *\fIpreg\fP);
|
||||
.fi
|
||||
.
|
||||
.SH DESCRIPTION
|
||||
.rs
|
||||
.sp
|
||||
This set of functions provides a POSIX-style API for the PCRE2 regular
|
||||
expression 8-bit library. See the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation for a description of PCRE2's native API, which contains much
|
||||
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
|
||||
and 32-bit libraries.
|
||||
.P
|
||||
The functions described here are just wrapper functions that ultimately call
|
||||
the PCRE2 native API. Their prototypes are defined in the \fBpcre2posix.h\fP
|
||||
header file, and on Unix systems the library itself is called
|
||||
\fBlibpcre2-posix.a\fP, so can be accessed by adding \fB-lpcre2-posix\fP to the
|
||||
command for linking an application that uses them. Because the POSIX functions
|
||||
call the native ones, it is also necessary to add \fB-lpcre2-8\fP.
|
||||
.P
|
||||
Those POSIX option bits that can reasonably be mapped to PCRE2 native options
|
||||
have been implemented. In addition, the option REG_EXTENDED is defined with the
|
||||
value zero. This has no effect, but since programs that are written to the
|
||||
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
|
||||
replacement library. Other POSIX options are not even defined.
|
||||
.P
|
||||
There are also some other options that are not defined by POSIX. These have
|
||||
been added at the request of users who want to make use of certain
|
||||
PCRE2-specific features via the POSIX calling interface.
|
||||
.P
|
||||
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
||||
in style. The syntax and semantics of the regular expressions themselves are
|
||||
still those of Perl, subject to the setting of various PCRE2 options, as
|
||||
described below. "POSIX-like in style" means that the API approximates to the
|
||||
POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding
|
||||
domains it is probably even less compatible.
|
||||
.P
|
||||
The header for these functions is supplied as \fBpcre2posix.h\fP to avoid any
|
||||
potential clash with other POSIX libraries. It can, of course, be renamed or
|
||||
aliased as \fBregex.h\fP, which is the "correct" name. It provides two
|
||||
structure types, \fIregex_t\fP for compiled internal forms, and
|
||||
\fIregmatch_t\fP for returning captured substrings. It also defines some
|
||||
constants whose names start with "REG_"; these are used for setting options and
|
||||
identifying error codes.
|
||||
.
|
||||
.
|
||||
.SH "COMPILING A PATTERN"
|
||||
.rs
|
||||
.sp
|
||||
The function \fBregcomp()\fP is called to compile a pattern into an
|
||||
internal form. The pattern is a C string terminated by a binary zero, and
|
||||
is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer
|
||||
to a \fBregex_t\fP structure that is used as a base for storing information
|
||||
about the compiled regular expression.
|
||||
.P
|
||||
The argument \fIcflags\fP is either zero, or contains one or more of the bits
|
||||
defined by the following macros:
|
||||
.sp
|
||||
REG_DOTALL
|
||||
.sp
|
||||
The PCRE2_DOTALL option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that REG_DOTALL is not part of the
|
||||
POSIX standard.
|
||||
.sp
|
||||
REG_ICASE
|
||||
.sp
|
||||
The PCRE2_CASELESS option is set when the regular expression is passed for
|
||||
compilation to the native function.
|
||||
.sp
|
||||
REG_NEWLINE
|
||||
.sp
|
||||
The PCRE2_MULTILINE option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that this does \fInot\fP mimic the
|
||||
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||
.sp
|
||||
REG_NOSUB
|
||||
.sp
|
||||
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
|
||||
for compilation to the native function. In addition, when a pattern that is
|
||||
compiled with this flag is passed to \fBregexec()\fP for matching, the
|
||||
\fInmatch\fP and \fIpmatch\fP arguments are ignored, and no captured strings
|
||||
are returned.
|
||||
.sp
|
||||
REG_UCP
|
||||
.sp
|
||||
The PCRE2_UCP option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes PCRE2 to use Unicode properties
|
||||
when matchine \ed, \ew, etc., instead of just recognizing ASCII values. Note
|
||||
that REG_UCP is not part of the POSIX standard.
|
||||
.sp
|
||||
REG_UNGREEDY
|
||||
.sp
|
||||
The PCRE2_UNGREEDY option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that REG_UNGREEDY is not part of the
|
||||
POSIX standard.
|
||||
.sp
|
||||
REG_UTF
|
||||
.sp
|
||||
The PCRE2_UTF option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes the pattern itself and all data
|
||||
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF
|
||||
is not part of the POSIX standard.
|
||||
.P
|
||||
In the absence of these flags, no options are passed to the native function.
|
||||
This means the the regex is compiled with PCRE2 default semantics. In
|
||||
particular, the way it handles newline characters in the subject string is the
|
||||
Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only
|
||||
\fIsome\fP of the effects specified for REG_NEWLINE. It does not affect the way
|
||||
newlines are matched by the dot metacharacter (they are not) or by a negative
|
||||
class such as [^a] (they are).
|
||||
.P
|
||||
The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The
|
||||
\fIpreg\fP structure is filled in on success, and one member of the structure
|
||||
is public: \fIre_nsub\fP contains the number of capturing subpatterns in
|
||||
the regular expression. Various error codes are defined in the header file.
|
||||
.P
|
||||
NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to
|
||||
use the contents of the \fIpreg\fP structure. If, for example, you pass it to
|
||||
\fBregexec()\fP, the result is undefined and your program is likely to crash.
|
||||
.
|
||||
.
|
||||
.SH "MATCHING NEWLINE CHARACTERS"
|
||||
.rs
|
||||
.sp
|
||||
This area is not simple, because POSIX and Perl take different views of things.
|
||||
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
|
||||
never intended to be a POSIX engine. The following table lists the different
|
||||
possibilities for matching newline characters in PCRE2:
|
||||
.sp
|
||||
Default Change with
|
||||
.sp
|
||||
. matches newline no PCRE2_DOTALL
|
||||
newline matches [^a] yes not changeable
|
||||
$ matches \en at end yes PCRE2_DOLLAR_ENDONLY
|
||||
$ matches \en in middle no PCRE2_MULTILINE
|
||||
^ matches \en in middle no PCRE2_MULTILINE
|
||||
.sp
|
||||
This is the equivalent table for POSIX:
|
||||
.sp
|
||||
Default Change with
|
||||
.sp
|
||||
. matches newline yes REG_NEWLINE
|
||||
newline matches [^a] yes REG_NEWLINE
|
||||
$ matches \en at end no REG_NEWLINE
|
||||
$ matches \en in middle no REG_NEWLINE
|
||||
^ matches \en in middle no REG_NEWLINE
|
||||
.sp
|
||||
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
|
||||
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
|
||||
newline from matching [^a].
|
||||
.P
|
||||
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
|
||||
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
|
||||
the REG_NEWLINE action.
|
||||
.
|
||||
.
|
||||
.SH "MATCHING A PATTERN"
|
||||
.rs
|
||||
.sp
|
||||
The function \fBregexec()\fP is called to match a compiled pattern \fIpreg\fP
|
||||
against a given \fIstring\fP, which is by default terminated by a zero byte
|
||||
(but see REG_STARTEND below), subject to the options in \fIeflags\fP. These can
|
||||
be:
|
||||
.sp
|
||||
REG_NOTBOL
|
||||
.sp
|
||||
The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching
|
||||
function.
|
||||
.sp
|
||||
REG_NOTEMPTY
|
||||
.sp
|
||||
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching
|
||||
function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
|
||||
setting this option can give more POSIX-like behaviour in some situations.
|
||||
.sp
|
||||
REG_NOTEOL
|
||||
.sp
|
||||
The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching
|
||||
function.
|
||||
.sp
|
||||
REG_STARTEND
|
||||
.sp
|
||||
The string is considered to start at \fIstring\fP + \fIpmatch[0].rm_so\fP and
|
||||
to have a terminating NUL located at \fIstring\fP + \fIpmatch[0].rm_eo\fP
|
||||
(there need not actually be a NUL at that location), regardless of the value of
|
||||
\fInmatch\fP. This is a BSD extension, compatible with but not specified by
|
||||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||
intended to be portable to other systems. Note that a non-zero \fIrm_so\fP does
|
||||
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
||||
how it is matched.
|
||||
.P
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
|
||||
\fBregexec()\fP are ignored.
|
||||
.P
|
||||
If the value of \fInmatch\fP is zero, or if the value \fIpmatch\fP is NULL,
|
||||
no data about any matched strings is returned.
|
||||
.P
|
||||
Otherwise,the portion of the string that was matched, and also any captured
|
||||
substrings, are returned via the \fIpmatch\fP argument, which points to an
|
||||
array of \fInmatch\fP structures of type \fIregmatch_t\fP, containing the
|
||||
members \fIrm_so\fP and \fIrm_eo\fP. These contain the byte offset to the first
|
||||
character of each substring and the offset to the first character after the end
|
||||
of each substring, respectively. The 0th element of the vector relates to the
|
||||
entire portion of \fIstring\fP that was matched; subsequent elements relate to
|
||||
the capturing subpatterns of the regular expression. Unused entries in the
|
||||
array have both structure members set to -1.
|
||||
.P
|
||||
A successful match yields a zero return; various error codes are defined in the
|
||||
header file, of which REG_NOMATCH is the "expected" failure code.
|
||||
.
|
||||
.
|
||||
.SH "ERROR MESSAGES"
|
||||
.rs
|
||||
.sp
|
||||
The \fBregerror()\fP function maps a non-zero errorcode from either
|
||||
\fBregcomp()\fP or \fBregexec()\fP to a printable message. If \fIpreg\fP is not
|
||||
NULL, the error should have arisen from the use of that structure. A message
|
||||
terminated by a binary zero is placed in \fIerrbuf\fP. The length of the
|
||||
message, including the zero, is limited to \fIerrbuf_size\fP. The yield of the
|
||||
function is the size of buffer needed to hold the whole message.
|
||||
.
|
||||
.
|
||||
.SH MEMORY USAGE
|
||||
.rs
|
||||
.sp
|
||||
Compiling a regular expression causes memory to be allocated and associated
|
||||
with the \fIpreg\fP structure. The function \fBregfree()\fP frees all such
|
||||
memory, after which \fIpreg\fP may no longer be used as a compiled expression.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Cambridge CB2 3QH, England.
|
||||
.fi
|
||||
.
|
||||
.
|
||||
.SH REVISION
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 October 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
94
doc/pcre2sample.3
Normal file
94
doc/pcre2sample.3
Normal file
@ -0,0 +1,94 @@
|
||||
.TH PCRE2SAMPLE 3 "20 October 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 SAMPLE PROGRAM"
|
||||
.rs
|
||||
.sp
|
||||
A simple, complete demonstration program to get you started with using PCRE2 is
|
||||
supplied in the file \fIpcre2demo.c\fP in the \fBsrc\fP directory in the PCRE2
|
||||
distribution. A listing of this program is given in the
|
||||
.\" HREF
|
||||
\fBpcre2demo\fP
|
||||
.\"
|
||||
documentation. If you do not have a copy of the PCRE2 distribution, you can
|
||||
save this listing to re-create the contents of \fIpcre2demo.c\fP.
|
||||
.P
|
||||
The demonstration program, which uses the PCRE2 8-bit library, compiles the
|
||||
regular expression that is its first argument, and matches it against the
|
||||
subject string in its second argument. No PCRE2 options are set, and default
|
||||
character tables are used. If matching succeeds, the program outputs the
|
||||
portion of the subject that matched, together with the contents of any captured
|
||||
substrings.
|
||||
.P
|
||||
If the -g option is given on the command line, the program then goes on to
|
||||
check for further matches of the same regular expression in the same subject
|
||||
string. The logic is a little bit tricky because of the possibility of matching
|
||||
an empty string. Comments in the code explain what is going on.
|
||||
.P
|
||||
If PCRE2 is installed in the standard include and library directories for your
|
||||
operating system, you should be able to compile the demonstration program using
|
||||
this command:
|
||||
.sp
|
||||
gcc -o pcre2demo pcre2demo.c -lpcre2-8
|
||||
.sp
|
||||
If PCRE2 is installed elsewhere, you may need to add additional options to the
|
||||
command line. For example, on a Unix-like system that has PCRE2 installed in
|
||||
\fI/usr/local\fP, you can compile the demonstration program using a command
|
||||
like this:
|
||||
.sp
|
||||
.\" JOINSH
|
||||
gcc -o pcre2demo -I/usr/local/include pcre2demo.c \e
|
||||
-L/usr/local/lib -lpcre2-8
|
||||
.sp
|
||||
.P
|
||||
Once you have compiled and linked the demonstration program, you can run simple
|
||||
tests like this:
|
||||
.sp
|
||||
./pcre2demo 'cat|dog' 'the cat sat on the mat'
|
||||
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
|
||||
.sp
|
||||
Note that there is a much more comprehensive test program, called
|
||||
.\" HREF
|
||||
\fBpcre2test\fP,
|
||||
.\"
|
||||
which supports many more facilities for testing regular expressions using the
|
||||
PCRE2 libraries. The
|
||||
.\" HREF
|
||||
\fBpcre2demo\fP
|
||||
.\"
|
||||
program is provided as a simple coding example.
|
||||
.P
|
||||
If you try to run
|
||||
.\" HREF
|
||||
\fBpcre2demo\fP
|
||||
.\"
|
||||
when PCRE2 is not installed in the standard library directory, you may get an
|
||||
error like this on some operating systems (e.g. Solaris):
|
||||
.sp
|
||||
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
|
||||
.sp
|
||||
This is caused by the way shared library support works on those systems. You
|
||||
need to add
|
||||
.sp
|
||||
-R/usr/local/lib
|
||||
.sp
|
||||
(for example) to the compile command to get round this problem.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Cambridge CB2 3QH, England.
|
||||
.fi
|
||||
.
|
||||
.
|
||||
.SH REVISION
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 October 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
199
doc/pcre2stack.3
Normal file
199
doc/pcre2stack.3
Normal file
@ -0,0 +1,199 @@
|
||||
.TH PCRE2STACK 3 "20 October 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 DISCUSSION OF STACK USAGE"
|
||||
.rs
|
||||
.sp
|
||||
When you call \fBpcre2_match()\fP, it makes use of an internal function called
|
||||
\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
|
||||
in order to remember the state of the match so that it can back up and try a
|
||||
different alternative after a failure. As matching proceeds deeper and deeper
|
||||
into the tree of possibilities, the recursion depth increases. The
|
||||
\fBmatch()\fP function is also called in other circumstances, for example,
|
||||
whenever a parenthesized sub-pattern is entered, and in certain cases of
|
||||
repetition.
|
||||
.P
|
||||
Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
|
||||
as a* it may be called several times at the same level, after matching
|
||||
different numbers of a's. Furthermore, in a number of cases where the result of
|
||||
the recursive call would immediately be passed back as the result of the
|
||||
current call (a "tail recursion"), the function is just restarted instead.
|
||||
.P
|
||||
The above comments apply when \fBpcre2_match()\fP is run in its normal
|
||||
interpretive manner. If the compiled pattern was processed by
|
||||
\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
|
||||
options passed to \fBpcre2_match()\fP were not incompatible, the matching
|
||||
process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
|
||||
this case, the memory requirements are handled entirely differently. See the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
.\"
|
||||
documentation for details.
|
||||
.P
|
||||
The \fBpcre2_dfa_match()\fP function operates in a different way to
|
||||
\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
|
||||
recursion or subroutine call in the pattern. This includes the processing of
|
||||
assertion and "once-only" subpatterns, which are handled like subroutine calls.
|
||||
Normally, these are never very deep, and the limit on the complexity of
|
||||
\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
|
||||
However, it is possible to write patterns with runaway infinite recursions;
|
||||
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack. At
|
||||
present, there is no protection against this.
|
||||
.P
|
||||
The comments that follow do NOT apply to \fBpcre2_dfa_match()\fP; they are
|
||||
relevant only for \fBpcre2_match()\fP without the JIT optimization.
|
||||
.
|
||||
.
|
||||
.SS "Reducing \fBpcre2_match()\fP's stack usage"
|
||||
.rs
|
||||
.sp
|
||||
Each time that the internal \fBmatch()\fP function is called recursively, it
|
||||
uses memory from the process stack. For certain kinds of pattern and data, very
|
||||
large amounts of stack may be needed, despite the recognition of "tail
|
||||
recursion". You can often reduce the amount of recursion, and therefore the
|
||||
amount of stack used, by modifying the pattern that is being matched. Consider,
|
||||
for example, this pattern:
|
||||
.sp
|
||||
([^<]|<(?!inet))+
|
||||
.sp
|
||||
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||
the data, and is the kind of pattern that might be used when processing an XML
|
||||
file. Each iteration of the outer parentheses matches either one character that
|
||||
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||
parenthesis is processed, a recursion occurs, so this formulation uses a stack
|
||||
frame for each matched character. For a long string, a lot of stack is
|
||||
required. Consider now this rewritten pattern, which matches exactly the same
|
||||
strings:
|
||||
.sp
|
||||
([^<]++|<(?!inet))+
|
||||
.sp
|
||||
This uses very much less stack, because runs of characters that do not contain
|
||||
"<" are "swallowed" in one item inside the parentheses. Recursion happens only
|
||||
when a "<" character that is not followed by "inet" is encountered (and we
|
||||
assume this is relatively rare). A possessive quantifier is used to stop any
|
||||
backtracking into the runs of non-"<" characters, but that is not related to
|
||||
stack usage.
|
||||
.P
|
||||
This example shows that one way of avoiding stack problems when matching long
|
||||
subject strings is to write repeated parenthesized subpatterns to match more
|
||||
than one character whenever possible.
|
||||
.
|
||||
.
|
||||
.SS "Compiling PCRE2 to use heap instead of stack for \fBpcre2_match()\fP"
|
||||
.rs
|
||||
.sp
|
||||
In environments where stack memory is constrained, you might want to compile
|
||||
PCRE2 to use heap memory instead of stack for remembering back-up points when
|
||||
\fBpcre2_match()\fP is running. This makes it run more slowly, however. Details
|
||||
of how to do this are given in the
|
||||
.\" HREF
|
||||
\fBpcre2build\fP
|
||||
.\"
|
||||
documentation. When built in this way, instead of using the stack, PCRE2
|
||||
gets memory for remembering backup points from the heap. By default, the memory
|
||||
is obtained by calling the system \fBmalloc()\fP function, but you can arrange
|
||||
to supply your own memory management function. For details, see the section
|
||||
entitled
|
||||
.\" HTML <a href="pcre2api.html#matchcontext">
|
||||
.\" </a>
|
||||
"The match context"
|
||||
.\"
|
||||
in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation. Since the block sizes are always the same, it may be possible to
|
||||
implement customized a memory handler that is more efficient than the standard
|
||||
function. The memory blocks obtained for this purpose are retained and re-used
|
||||
if possible while \fBpcre2_match()\fP is running. They are all freed just
|
||||
before it exits.
|
||||
.
|
||||
.
|
||||
.SS "Limiting \fBpcre2_match()\fP's stack usage"
|
||||
.rs
|
||||
.sp
|
||||
You can set limits on the number of times the internal \fBmatch()\fP function
|
||||
is called, both in total and recursively. If a limit is exceeded,
|
||||
\fBpcre2_match()\fP returns an error code. Setting suitable limits should
|
||||
prevent it from running out of stack. The default values of the limits are very
|
||||
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
|
||||
and they can also be set when \fBpcre2_match()\fP is called. For details of
|
||||
these interfaces, see the
|
||||
.\" HREF
|
||||
\fBpcre2build\fP
|
||||
.\"
|
||||
documentation and the section entitled
|
||||
.\" HTML <a href="pcre2api.html#matchcontext">
|
||||
.\" </a>
|
||||
"The match context"
|
||||
.\"
|
||||
in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation.
|
||||
.P
|
||||
As a very rough rule of thumb, you should reckon on about 500 bytes per
|
||||
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
|
||||
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
|
||||
around 128000 recursions.
|
||||
.P
|
||||
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
|
||||
applied to a subject line, causes it to find the smallest limits that allow a a
|
||||
pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
|
||||
different limits.
|
||||
.
|
||||
.
|
||||
.SS "Changing stack size in Unix-like systems"
|
||||
.rs
|
||||
.sp
|
||||
In Unix-like environments, there is not often a problem with the stack unless
|
||||
very long strings are involved, though the default limit on stack size varies
|
||||
from system to system. Values from 8Mb to 64Mb are common. You can find your
|
||||
default limit by running the command:
|
||||
.sp
|
||||
ulimit -s
|
||||
.sp
|
||||
Unfortunately, the effect of running out of stack is often SIGSEGV, though
|
||||
sometimes a more explicit error message is given. You can normally increase the
|
||||
limit on stack size by code such as this:
|
||||
.sp
|
||||
struct rlimit rlim;
|
||||
getrlimit(RLIMIT_STACK, &rlim);
|
||||
rlim.rlim_cur = 100*1024*1024;
|
||||
setrlimit(RLIMIT_STACK, &rlim);
|
||||
.sp
|
||||
This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
|
||||
attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
|
||||
do this before calling \fBpcre2_match()\fP.
|
||||
.
|
||||
.
|
||||
.SS "Changing stack size in Mac OS X"
|
||||
.rs
|
||||
.sp
|
||||
Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
|
||||
is also possible to set a stack size when linking a program. There is a
|
||||
discussion about stack sizes in Mac OS X at this web site:
|
||||
.\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
|
||||
.\" </a>
|
||||
http://developer.apple.com/qa/qa2005/qa1419.html.
|
||||
.\"
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Cambridge CB2 3QH, England.
|
||||
.fi
|
||||
.
|
||||
.
|
||||
.SH REVISION
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 October 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
540
doc/pcre2syntax.3
Normal file
540
doc/pcre2syntax.3
Normal file
@ -0,0 +1,540 @@
|
||||
.TH PCRE2SYNTAX 3 "20 October 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
.rs
|
||||
.sp
|
||||
The full syntax and semantics of the regular expressions that are supported by
|
||||
PCRE2 are described in the
|
||||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
documentation. This document contains a quick-reference summary of the syntax.
|
||||
.
|
||||
.
|
||||
.SH "QUOTING"
|
||||
.rs
|
||||
.sp
|
||||
\ex where x is non-alphanumeric is a literal x
|
||||
\eQ...\eE treat enclosed characters as literal
|
||||
.
|
||||
.
|
||||
.SH "CHARACTERS"
|
||||
.rs
|
||||
.sp
|
||||
\ea alarm, that is, the BEL character (hex 07)
|
||||
\ecx "control-x", where x is any ASCII character
|
||||
\ee escape (hex 1B)
|
||||
\ef form feed (hex 0C)
|
||||
\en newline (hex 0A)
|
||||
\er carriage return (hex 0D)
|
||||
\et tab (hex 09)
|
||||
\e0dd character with octal code 0dd
|
||||
\eddd character with octal code ddd, or backreference
|
||||
\eo{ddd..} character with octal code ddd..
|
||||
\exhh character with hex code hh
|
||||
\ex{hhh..} character with hex code hhh..
|
||||
.sp
|
||||
Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
|
||||
characters "8" and "9".
|
||||
.
|
||||
.
|
||||
.SH "CHARACTER TYPES"
|
||||
.rs
|
||||
.sp
|
||||
. any character except newline;
|
||||
in dotall mode, any character whatsoever
|
||||
\eC one data unit, even in UTF mode (best avoided)
|
||||
\ed a decimal digit
|
||||
\eD a character that is not a decimal digit
|
||||
\eh a horizontal white space character
|
||||
\eH a character that is not a horizontal white space character
|
||||
\eN a character that is not a newline
|
||||
\ep{\fIxx\fP} a character with the \fIxx\fP property
|
||||
\eP{\fIxx\fP} a character without the \fIxx\fP property
|
||||
\eR a newline sequence
|
||||
\es a white space character
|
||||
\eS a character that is not a white space character
|
||||
\ev a vertical white space character
|
||||
\eV a character that is not a vertical white space character
|
||||
\ew a "word" character
|
||||
\eW a "non-word" character
|
||||
\eX a Unicode extended grapheme cluster
|
||||
.sp
|
||||
By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
|
||||
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
|
||||
happening, \es and \ew may also match characters with code points in the range
|
||||
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
|
||||
sequences is changed to use Unicode properties and they match many more
|
||||
characters.
|
||||
.
|
||||
.
|
||||
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
|
||||
.rs
|
||||
.sp
|
||||
C Other
|
||||
Cc Control
|
||||
Cf Format
|
||||
Cn Unassigned
|
||||
Co Private use
|
||||
Cs Surrogate
|
||||
.sp
|
||||
L Letter
|
||||
Ll Lower case letter
|
||||
Lm Modifier letter
|
||||
Lo Other letter
|
||||
Lt Title case letter
|
||||
Lu Upper case letter
|
||||
L& Ll, Lu, or Lt
|
||||
.sp
|
||||
M Mark
|
||||
Mc Spacing mark
|
||||
Me Enclosing mark
|
||||
Mn Non-spacing mark
|
||||
.sp
|
||||
N Number
|
||||
Nd Decimal number
|
||||
Nl Letter number
|
||||
No Other number
|
||||
.sp
|
||||
P Punctuation
|
||||
Pc Connector punctuation
|
||||
Pd Dash punctuation
|
||||
Pe Close punctuation
|
||||
Pf Final punctuation
|
||||
Pi Initial punctuation
|
||||
Po Other punctuation
|
||||
Ps Open punctuation
|
||||
.sp
|
||||
S Symbol
|
||||
Sc Currency symbol
|
||||
Sk Modifier symbol
|
||||
Sm Mathematical symbol
|
||||
So Other symbol
|
||||
.sp
|
||||
Z Separator
|
||||
Zl Line separator
|
||||
Zp Paragraph separator
|
||||
Zs Space separator
|
||||
.
|
||||
.
|
||||
.SH "PCRE2 SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
|
||||
.rs
|
||||
.sp
|
||||
Xan Alphanumeric: union of properties L and N
|
||||
Xps POSIX space: property Z or tab, NL, VT, FF, CR
|
||||
Xsp Perl space: property Z or tab, NL, VT, FF, CR
|
||||
Xuc Univerally-named character: one that can be
|
||||
represented by a Universal Character Name
|
||||
Xwd Perl word: property Xan or underscore
|
||||
.sp
|
||||
Perl and POSIX space are now the same. Perl added VT to its space character set
|
||||
at release 5.18.
|
||||
.
|
||||
.
|
||||
.SH "SCRIPT NAMES FOR \ep AND \eP"
|
||||
.rs
|
||||
.sp
|
||||
Arabic,
|
||||
Armenian,
|
||||
Avestan,
|
||||
Balinese,
|
||||
Bamum,
|
||||
Bassa_Vah,
|
||||
Batak,
|
||||
Bengali,
|
||||
Bopomofo,
|
||||
Brahmi,
|
||||
Braille,
|
||||
Buginese,
|
||||
Buhid,
|
||||
Canadian_Aboriginal,
|
||||
Carian,
|
||||
Caucasian_Albanian,
|
||||
Chakma,
|
||||
Cham,
|
||||
Cherokee,
|
||||
Common,
|
||||
Coptic,
|
||||
Cuneiform,
|
||||
Cypriot,
|
||||
Cyrillic,
|
||||
Deseret,
|
||||
Devanagari,
|
||||
Duployan,
|
||||
Egyptian_Hieroglyphs,
|
||||
Elbasan,
|
||||
Ethiopic,
|
||||
Georgian,
|
||||
Glagolitic,
|
||||
Gothic,
|
||||
Grantha,
|
||||
Greek,
|
||||
Gujarati,
|
||||
Gurmukhi,
|
||||
Han,
|
||||
Hangul,
|
||||
Hanunoo,
|
||||
Hebrew,
|
||||
Hiragana,
|
||||
Imperial_Aramaic,
|
||||
Inherited,
|
||||
Inscriptional_Pahlavi,
|
||||
Inscriptional_Parthian,
|
||||
Javanese,
|
||||
Kaithi,
|
||||
Kannada,
|
||||
Katakana,
|
||||
Kayah_Li,
|
||||
Kharoshthi,
|
||||
Khmer,
|
||||
Khojki,
|
||||
Khudawadi,
|
||||
Lao,
|
||||
Latin,
|
||||
Lepcha,
|
||||
Limbu,
|
||||
Linear_A,
|
||||
Linear_B,
|
||||
Lisu,
|
||||
Lycian,
|
||||
Lydian,
|
||||
Mahajani,
|
||||
Malayalam,
|
||||
Mandaic,
|
||||
Manichaean,
|
||||
Meetei_Mayek,
|
||||
Mende_Kikakui,
|
||||
Meroitic_Cursive,
|
||||
Meroitic_Hieroglyphs,
|
||||
Miao,
|
||||
Modi,
|
||||
Mongolian,
|
||||
Mro,
|
||||
Myanmar,
|
||||
Nabataean,
|
||||
New_Tai_Lue,
|
||||
Nko,
|
||||
Ogham,
|
||||
Ol_Chiki,
|
||||
Old_Italic,
|
||||
Old_North_Arabian,
|
||||
Old_Permic,
|
||||
Old_Persian,
|
||||
Old_South_Arabian,
|
||||
Old_Turkic,
|
||||
Oriya,
|
||||
Osmanya,
|
||||
Pahawh_Hmong,
|
||||
Palmyrene,
|
||||
Pau_Cin_Hau,
|
||||
Phags_Pa,
|
||||
Phoenician,
|
||||
Psalter_Pahlavi,
|
||||
Rejang,
|
||||
Runic,
|
||||
Samaritan,
|
||||
Saurashtra,
|
||||
Sharada,
|
||||
Shavian,
|
||||
Siddham,
|
||||
Sinhala,
|
||||
Sora_Sompeng,
|
||||
Sundanese,
|
||||
Syloti_Nagri,
|
||||
Syriac,
|
||||
Tagalog,
|
||||
Tagbanwa,
|
||||
Tai_Le,
|
||||
Tai_Tham,
|
||||
Tai_Viet,
|
||||
Takri,
|
||||
Tamil,
|
||||
Telugu,
|
||||
Thaana,
|
||||
Thai,
|
||||
Tibetan,
|
||||
Tifinagh,
|
||||
Tirhuta,
|
||||
Ugaritic,
|
||||
Vai,
|
||||
Warang_Citi,
|
||||
Yi.
|
||||
.
|
||||
.
|
||||
.SH "CHARACTER CLASSES"
|
||||
.rs
|
||||
.sp
|
||||
[...] positive character class
|
||||
[^...] negative character class
|
||||
[x-y] range (can be used for hex characters)
|
||||
[[:xxx:]] positive POSIX named set
|
||||
[[:^xxx:]] negative POSIX named set
|
||||
.sp
|
||||
alnum alphanumeric
|
||||
alpha alphabetic
|
||||
ascii 0-127
|
||||
blank space or tab
|
||||
cntrl control character
|
||||
digit decimal digit
|
||||
graph printing, excluding space
|
||||
lower lower case letter
|
||||
print printing, including space
|
||||
punct printing, excluding alphanumeric
|
||||
space white space
|
||||
upper upper case letter
|
||||
word same as \ew
|
||||
xdigit hexadecimal digit
|
||||
.sp
|
||||
In PCRE2, POSIX character set names recognize only ASCII characters by default,
|
||||
but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||
\eQ...\eE inside a character class.
|
||||
.
|
||||
.
|
||||
.SH "QUANTIFIERS"
|
||||
.rs
|
||||
.sp
|
||||
? 0 or 1, greedy
|
||||
?+ 0 or 1, possessive
|
||||
?? 0 or 1, lazy
|
||||
* 0 or more, greedy
|
||||
*+ 0 or more, possessive
|
||||
*? 0 or more, lazy
|
||||
+ 1 or more, greedy
|
||||
++ 1 or more, possessive
|
||||
+? 1 or more, lazy
|
||||
{n} exactly n
|
||||
{n,m} at least n, no more than m, greedy
|
||||
{n,m}+ at least n, no more than m, possessive
|
||||
{n,m}? at least n, no more than m, lazy
|
||||
{n,} n or more, greedy
|
||||
{n,}+ n or more, possessive
|
||||
{n,}? n or more, lazy
|
||||
.
|
||||
.
|
||||
.SH "ANCHORS AND SIMPLE ASSERTIONS"
|
||||
.rs
|
||||
.sp
|
||||
\eb word boundary
|
||||
\eB not a word boundary
|
||||
^ start of subject
|
||||
also after internal newline in multiline mode
|
||||
\eA start of subject
|
||||
$ end of subject
|
||||
also before newline at end of subject
|
||||
also before internal newline in multiline mode
|
||||
\eZ end of subject
|
||||
also before newline at end of subject
|
||||
\ez end of subject
|
||||
\eG first matching position in subject
|
||||
.
|
||||
.
|
||||
.SH "MATCH POINT RESET"
|
||||
.rs
|
||||
.sp
|
||||
\eK reset start of match
|
||||
.sp
|
||||
\eK is honoured in positive assertions, but ignored in negative ones.
|
||||
.
|
||||
.
|
||||
.SH "ALTERNATION"
|
||||
.rs
|
||||
.sp
|
||||
expr|expr|expr...
|
||||
.
|
||||
.
|
||||
.SH "CAPTURING"
|
||||
.rs
|
||||
.sp
|
||||
(...) capturing group
|
||||
(?<name>...) named capturing group (Perl)
|
||||
(?'name'...) named capturing group (Perl)
|
||||
(?P<name>...) named capturing group (Python)
|
||||
(?:...) non-capturing group
|
||||
(?|...) non-capturing group; reset group numbers for
|
||||
capturing groups in each alternative
|
||||
.
|
||||
.
|
||||
.SH "ATOMIC GROUPS"
|
||||
.rs
|
||||
.sp
|
||||
(?>...) atomic, non-capturing group
|
||||
.
|
||||
.
|
||||
.
|
||||
.
|
||||
.SH "COMMENT"
|
||||
.rs
|
||||
.sp
|
||||
(?#....) comment (not nestable)
|
||||
.
|
||||
.
|
||||
.SH "OPTION SETTING"
|
||||
.rs
|
||||
.sp
|
||||
(?i) caseless
|
||||
(?J) allow duplicate names
|
||||
(?m) multiline
|
||||
(?s) single line (dotall)
|
||||
(?U) default ungreedy (lazy)
|
||||
(?x) extended (ignore white space)
|
||||
(?-...) unset option(s)
|
||||
.sp
|
||||
The following are recognized only at the very start of a pattern or after one
|
||||
of the newline or \eR options with similar syntax. More than one of them may
|
||||
appear.
|
||||
.sp
|
||||
(*LIMIT_MATCH=d) set the match limit to d (decimal number)
|
||||
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
|
||||
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
||||
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
||||
(*UTF) set appropriate UTF mode for the library in use
|
||||
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
||||
.sp
|
||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||
limits set by the caller of pcre2_exec(), not increase them.
|
||||
.
|
||||
.
|
||||
.SH "NEWLINE CONVENTION"
|
||||
.rs
|
||||
.sp
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
settings with a similar syntax.
|
||||
.sp
|
||||
(*CR) carriage return only
|
||||
(*LF) linefeed only
|
||||
(*CRLF) carriage return followed by linefeed
|
||||
(*ANYCRLF) all three of the above
|
||||
(*ANY) any Unicode newline sequence
|
||||
.
|
||||
.
|
||||
.SH "WHAT \eR MATCHES"
|
||||
.rs
|
||||
.sp
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
setting with a similar syntax.
|
||||
.sp
|
||||
(*BSR_ANYCRLF) CR, LF, or CRLF
|
||||
(*BSR_UNICODE) any Unicode newline sequence
|
||||
.
|
||||
.
|
||||
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
|
||||
.rs
|
||||
.sp
|
||||
(?=...) positive look ahead
|
||||
(?!...) negative look ahead
|
||||
(?<=...) positive look behind
|
||||
(?<!...) negative look behind
|
||||
.sp
|
||||
Each top-level branch of a look behind must be of a fixed length.
|
||||
.
|
||||
.
|
||||
.SH "BACKREFERENCES"
|
||||
.rs
|
||||
.sp
|
||||
\en reference by number (can be ambiguous)
|
||||
\egn reference by number
|
||||
\eg{n} reference by number
|
||||
\eg{-n} relative reference by number
|
||||
\ek<name> reference by name (Perl)
|
||||
\ek'name' reference by name (Perl)
|
||||
\eg{name} reference by name (Perl)
|
||||
\ek{name} reference by name (.NET)
|
||||
(?P=name) reference by name (Python)
|
||||
.
|
||||
.
|
||||
.SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
|
||||
.rs
|
||||
.sp
|
||||
(?R) recurse whole pattern
|
||||
(?n) call subpattern by absolute number
|
||||
(?+n) call subpattern by relative number
|
||||
(?-n) call subpattern by relative number
|
||||
(?&name) call subpattern by name (Perl)
|
||||
(?P>name) call subpattern by name (Python)
|
||||
\eg<name> call subpattern by name (Oniguruma)
|
||||
\eg'name' call subpattern by name (Oniguruma)
|
||||
\eg<n> call subpattern by absolute number (Oniguruma)
|
||||
\eg'n' call subpattern by absolute number (Oniguruma)
|
||||
\eg<+n> call subpattern by relative number (PCRE2 extension)
|
||||
\eg'+n' call subpattern by relative number (PCRE2 extension)
|
||||
\eg<-n> call subpattern by relative number (PCRE2 extension)
|
||||
\eg'-n' call subpattern by relative number (PCRE2 extension)
|
||||
.
|
||||
.
|
||||
.SH "CONDITIONAL PATTERNS"
|
||||
.rs
|
||||
.sp
|
||||
(?(condition)yes-pattern)
|
||||
(?(condition)yes-pattern|no-pattern)
|
||||
.sp
|
||||
(?(n)... absolute reference condition
|
||||
(?(+n)... relative reference condition
|
||||
(?(-n)... relative reference condition
|
||||
(?(<name>)... named reference condition (Perl)
|
||||
(?('name')... named reference condition (Perl)
|
||||
(?(name)... named reference condition (PCRE2)
|
||||
(?(R)... overall recursion condition
|
||||
(?(Rn)... specific group recursion condition
|
||||
(?(R&name)... specific recursion condition
|
||||
(?(DEFINE)... define subpattern for reference
|
||||
(?(assert)... assertion condition
|
||||
.
|
||||
.
|
||||
.SH "BACKTRACKING CONTROL"
|
||||
.rs
|
||||
.sp
|
||||
The following act immediately they are reached:
|
||||
.sp
|
||||
(*ACCEPT) force successful match
|
||||
(*FAIL) force backtrack; synonym (*F)
|
||||
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
||||
.sp
|
||||
The following act only when a subsequent match failure causes a backtrack to
|
||||
reach them. They all force a match failure, but they differ in what happens
|
||||
afterwards. Those that advance the start-of-match point do so only if the
|
||||
pattern is not anchored.
|
||||
.sp
|
||||
(*COMMIT) overall failure, no advance of starting point
|
||||
(*PRUNE) advance to next starting character
|
||||
(*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
|
||||
(*SKIP) advance to current matching position
|
||||
(*SKIP:NAME) advance to position corresponding to an earlier
|
||||
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
||||
(*THEN) local failure, backtrack to next alternation
|
||||
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
|
||||
.
|
||||
.
|
||||
.SH "CALLOUTS"
|
||||
.rs
|
||||
.sp
|
||||
(?C) callout
|
||||
(?Cn) callout with data n
|
||||
.
|
||||
.
|
||||
.SH "SEE ALSO"
|
||||
.rs
|
||||
.sp
|
||||
\fBpcre2pattern\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3),
|
||||
\fBpcre2matching\fP(3), \fBpcre2\fP(3).
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Cambridge CB2 3QH, England.
|
||||
.fi
|
||||
.
|
||||
.
|
||||
.SH REVISION
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 October 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
Loading…
Reference in New Issue
Block a user