Fix error in documentation.

This commit is contained in:
ph10 2016-10-28 16:08:44 +00:00
parent f9bba1c615
commit cc3a8ad246

146
HACKING
View File

@ -7,7 +7,7 @@ but with a revised (and incompatible) API. To avoid confusion, the original
library is referred to as PCRE1 below. For information about testing PCRE2, see
the pcre2test documentation and the comment at the head of the RunTest file.
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
confusion with PCRE1.
@ -124,39 +124,39 @@ compile) has full knowledge of group names and numbers throughout. Several
dozen lines of messy code were eliminated, though the new pre-pass was not
short. In particular, parsing and skipping over [] classes is complicated.
While working on 10.22 I realized that I could simplify yet again by moving
While working on 10.22 I realized that I could simplify yet again by moving
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
after 10.22 was released, the code underwent yet another big refactoring. This
is how it is from 10.23 onwards:
The function called parse_regex() scans the pattern characters, parsing them
into literal data and meta characters. It converts escapes such as \x{123}
into literals, handles \Q...\E, and skips over comments and non-significant
white space. The result of the scanning is put into a vector of 32-bit unsigned
integers. Values less than 0x80000000 are literal data. Higher values represent
The function called parse_regex() scans the pattern characters, parsing them
into literal data and meta characters. It converts escapes such as \x{123}
into literals, handles \Q...\E, and skips over comments and non-significant
white space. The result of the scanning is put into a vector of 32-bit unsigned
integers. Values less than 0x80000000 are literal data. Higher values represent
meta-characters. The top 16-bits of such values identify the meta-character,
and these are given names such as META_CAPTURE. The lower 16-bits are available
for data, for example, the capturing group number. The only situation in which
literal data values greater than 0x7fffffff can appear is when the 32-bit
library is running in non-UTF mode. This is handled by having a special
for data, for example, the capturing group number. The only situation in which
literal data values greater than 0x7fffffff can appear is when the 32-bit
library is running in non-UTF mode. This is handled by having a special
meta-character that is followed by the 32-bit data value.
The size of the parsed pattern vector, when auto-callouts are not enabled, is
bounded by the length of the pattern (with one exception). The code is written
so that each item in the pattern uses no more vector elements than the number
bounded by the length of the pattern (with one exception). The code is written
so that each item in the pattern uses no more vector elements than the number
of code units in the item itself. The exception is the aforementioned large
32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
advance to check for such values. When auto-callouts are enabled, the generous
assumption is made that there will be a callout for each pattern code unit
(which of course is only actually true if all code units are literals) plus one
(which of course is only actually true if all code units are literals) plus one
at the end. There is a default parsed pattern vector on the stack, but if this
is not big enough, heap memory is used.
As before, the actual compiling function is run twice, the first time to
determine the amount of memory needed for the final compiled pattern. It
As before, the actual compiling function is run twice, the first time to
determine the amount of memory needed for the final compiled pattern. It
now processes the parsed pattern vector, not the pattern itself, although some
of the parsed items refer to strings in the pattern - for example, group
names. As escapes and comments have already been processed, the code is a bit
names. As escapes and comments have already been processed, the code is a bit
simpler than before.
Most errors can be diagnosed during the parsing scan. For those that cannot
@ -168,64 +168,67 @@ identify where errors occur.
The elements of the parsed pattern vector
-----------------------------------------
The word "offset" below means a code unit offset into the pattern. When
The word "offset" below means a code unit offset into the pattern. When
PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
stored in a single parsed pattern element. Otherwise (typically on 64-bit
systems) it occupies two elements. The following meta items occupy just one
element, with no data:
META_ACCEPT (*ACCEPT)
META_ALT | alternation
META_ASTERISK *
META_ASTERISK_PLUS *+
META_ASTERISK_QUERY *?
META_ATOMIC (?> start of atomic group
META_CIRCUMFLEX ^ metacharacter
META_CLASS [ start of non-empty class
META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
META_ASTERISK *
META_ASTERISK_PLUS *+
META_ASTERISK_QUERY *?
META_ATOMIC (?> start of atomic group
META_CIRCUMFLEX ^ metacharacter
META_CLASS [ start of non-empty class
META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
META_CLASS_END ] end of non-empty class
META_CLASS_NOT [^ start non-empty negative class
META_CLASS_END ] end of non-empty class
META_CLASS_NOT [^ start non-empty negative class
META_COMMIT (*COMMIT)
META_DOLLAR $ metacharacter
META_DOT . metacharacter
META_DOLLAR $ metacharacter
META_DOT . metacharacter
META_END End of pattern (this value is 0x80000000)
META_FAIL (*FAIL)
META_KET ) closing parenthesis
META_KET ) closing parenthesis
META_LOOKAHEAD (?= start of lookahead
META_LOOKAHEADNOT (?! start of negative lookahead
META_NOCAPTURE (?: no capture parens
META_PLUS +
META_PLUS_PLUS ++
META_PLUS_QUERY +?
META_NOCAPTURE (?: no capture parens
META_PLUS +
META_PLUS_PLUS ++
META_PLUS_QUERY +?
META_PRUNE (*PRUNE) - no argument
META_QUERY ?
META_QUERY_PLUS ?+
META_QUERY_QUERY ??
META_RANGE_ESCAPED hyphen in class range with at least one escape
META_RANGE_LITERAL hyphen in class range defined literally
META_QUERY ?
META_QUERY_PLUS ?+
META_QUERY_QUERY ??
META_RANGE_ESCAPED hyphen in class range with at least one escape
META_RANGE_LITERAL hyphen in class range defined literally
META_SKIP (*SKIP) - no argument
META_THEN (*THEN) - no argument
The two RANGE values occur only in character classes. They are positioned
between two literals that define the start and end of the range. In an EBCDIC
evironment it is necessary to know whether either of the range values was
specified as an escape. In an ASCII/Unicode environment the distinction is not
The two RANGE values occur only in character classes. They are positioned
between two literals that define the start and end of the range. In an EBCDIC
evironment it is necessary to know whether either of the range values was
specified as an escape. In an ASCII/Unicode environment the distinction is not
relevant.
The following have data in the lower 16 bits, and may be followed by other data
The following have data in the lower 16 bits, and may be followed by other data
elements:
META_ALT | alternation
META_BACKREF
META_CAPTURE
META_ESCAPE
META_RECURSE
If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
is the length of its branch, for which OP_REVERSE must be generated.
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
their data in the lower 16 bits of the element.
META_BACKREF is followed by an offset if the back reference group number is 10
or more. The offsets of the first ocurrences of references to groups whose
or more. The offsets of the first ocurrences of references to groups whose
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
occurrence is useful). On 64-bit systems this avoids using more than two parsed
pattern elements for items such as \3. The offset is used when an error is
@ -241,7 +244,7 @@ and an offset into the pattern to specify the name.
The following have one data item that follows in the next vector element:
META_BIGVALUE Next is a literal >= META_END
META_BIGVALUE Next is a literal >= META_END
META_OPTIONS (?i) and friends (data is new option bits)
META_POSIX POSIX class item (data identifies the class)
META_POSIX_NEG negative POSIX class item (ditto)
@ -249,19 +252,19 @@ META_POSIX_NEG negative POSIX class item (ditto)
The following are followed by a length element, then a number of character code
values (which should match with the length):
META_MARK (*MARK:xxxx)
META_MARK (*MARK:xxxx)
META_PRUNE_ARG (*PRUNE:xxx)
META_SKIP_ARG (*SKIP:xxxx)
META_THEN_ARG (*THEN:xxxx)
The following are followed by a length element, then an offset in the pattern
The following are followed by a length element, then an offset in the pattern
that identifies the name:
META_COND_NAME (?(<name>) or (?('name') or (?(name)
META_COND_RNAME (?(R&name)
META_COND_NAME (?(<name>) or (?('name') or (?(name)
META_COND_RNAME (?(R&name)
META_COND_RNUMBER (?(Rdigits)
META_RECURSE_BYNAME (?&name)
META_BACKREF_BYNAME \k'name'
META_RECURSE_BYNAME (?&name)
META_BACKREF_BYNAME \k'name'
META_COND_RNUMBER is used for names that start with R and continue with digits,
because this is an ambiguous case. It could be a back reference to a group with
@ -269,26 +272,31 @@ that name, or it could be a recursion test on a numbered group.
This one is followed by an offset, for use in error messages, then a number:
META_COND_NUMBER (?([+-]digits)
META_COND_NUMBER (?([+-]digits)
The following are followed just by an offset, for use in error messages:
META_COND_ASSERT (?(?assertion)
META_COND_DEFINE (?(DEFINE)
META_LOOKBEHIND (?<=
META_LOOKBEHINDNOT (?<!
In fact, META_COND_ASSERT is used for any group starting (?( that does not
match any of the other META_COND cases. The check that this group is an
assertion (optionally preceded by a callout) happens at compile time.
In fact, META_COND_ASSERT is used for any group starting (?( that does not
match any of the other META_COND cases. The check that this group is an
assertion (optionally preceded by a callout) happens at compile time.
The following are also followed just by an offset, but also the lower 16 bits
of the main word contain the length of the first branch of the lookbehind
group; this is used when generating OP_REVERSE for that branch.
META_LOOKBEHIND (?<=
META_LOOKBEHINDNOT (?<!
The following are followed by two values, the minimum and maximum. Repeat
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
META_MINMAX {n,m} repeat
META_MINMAX_PLUS {n,m}+ repeat
META_MINMAX_QUERY {n,m}? repeat
META_MINMAX {n,m} repeat
META_MINMAX_PLUS {n,m}+ repeat
META_MINMAX_QUERY {n,m}? repeat
This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
the next two are the major and minor numbers:
@ -297,11 +305,11 @@ META_COND_VERSION (?(VERSION<op>x.y)
Callouts are converted into one of two items:
META_CALLOUT_NUMBER (?C with numerical argument
META_CALLOUT_STRING (?C with string argument
META_CALLOUT_NUMBER (?C with numerical argument
META_CALLOUT_STRING (?C with string argument
In both cases, the next two elements contain the offset and length of the next
item in the pattern. Then there is either one callout number, or a length and
In both cases, the next two elements contain the offset and length of the next
item in the pattern. Then there is either one callout number, or a length and
an offset for the string argument. The length includes both delimiters.
@ -410,11 +418,11 @@ These items are all just one unit long
OP_THEN )
OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
This ends the assertion, not the entire pattern match. The assertion (?!) is
This ends the assertion, not the entire pattern match. The assertion (?!) is
always optimized to OP_FAIL.
OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
non-UTF modes and in UTF-32 mode (since one code unit still equals one
non-UTF modes and in UTF-32 mode (since one code unit still equals one
character). Another use is for [^] when empty classes are permitted
(PCRE2_ALLOW_EMPTY_CLASS is set).
@ -735,8 +743,8 @@ immediately before the assertion. It is also possible to insert a manual
callout at this point. Only assertion conditions may have callouts preceding
the condition.
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
parts of the pattern, so this is another opcode that may appear as a condition.
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
parts of the pattern, so this is another opcode that may appear as a condition.
It is treated the same as OP_FALSE.