Documentation update.

This commit is contained in:
ph10 2017-04-21 16:30:18 +00:00
parent 481b69c21b
commit 88c15c7aa5

110
HACKING
View File

@ -48,18 +48,20 @@ Friedl's terminology.
OK, here's the real stuff
-------------------------
For the set of functions that formed the original PCRE1 library (which are
unrelated to those mentioned above), I tried at first to invent an algorithm
that used an amount of store bounded by a multiple of the number of characters
in the pattern, to save on compiling time. However, because of the greater
complexity in Perl regular expressions, I couldn't do this. In any case, a
first pass through the pattern is helpful for other reasons.
For the set of functions that formed the original PCRE1 library in 1997 (which
are unrelated to those mentioned above), I tried at first to invent an
algorithm that used an amount of store bounded by a multiple of the number of
characters in the pattern, to save on compiling time. However, because of the
greater complexity in Perl regular expressions, I couldn't do this, even though
the then current Perl 5.004 patterns were much simpler than those supported
nowadays. In any case, a first pass through the pattern is helpful for other
reasons.
Support for 16-bit and 32-bit data strings
-------------------------------------------
The library can be compiled in any combination of 8-bit, 16-bit or 32-bit
The PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit
modes, creating up to three different libraries. In the description that
follows, the word "short" is used for a 16-bit data quantity, and the phrase
"code unit" is used for a quantity that is a byte in 8-bit mode, a short in
@ -122,7 +124,7 @@ all the named subpatterns and their corresponding group numbers. This means
that the actual compile (both the memory-computing dummy run and the real
compile) has full knowledge of group names and numbers throughout. Several
dozen lines of messy code were eliminated, though the new pre-pass was not
short. In particular, parsing and skipping over [] classes was complicated.
short. In particular, parsing and skipping over [] classes is complicated.
While working on 10.22 I realized that I could simplify yet again by moving
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
@ -149,8 +151,8 @@ of code units in the item itself. The exception is the aforementioned large
advance to check for such values. When auto-callouts are enabled, the generous
assumption is made that there will be a callout for each pattern code unit
(which of course is only actually true if all code units are literals) plus one
at the end. There is a default parsed pattern vector on the stack, but if this
is not big enough, heap memory is used.
at the end. There is a default parsed pattern vector on the system stack, but
if this is not big enough, heap memory is used.
As before, the actual compiling function is run twice, the first time to
determine the amount of memory needed for the final compiled pattern. It
@ -343,9 +345,14 @@ Changeable options
------------------
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
some others) may change in the middle of patterns. Their processing is handled
entirely at compile time by generating different opcodes for the different
settings. The runtime functions do not need to keep track of an options state.
others) may be changed in the middle of patterns by items such as (?i). Their
processing is handled entirely at compile time by generating different opcodes
for the different settings. The runtime functions do not need to keep track of
an options state.
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
are tracked and processed during the parsing pre-pass. The others are handled
from META_OPTIONS items during the main compile phase.
Format of compiled patterns
@ -437,14 +444,22 @@ Matching literal characters
---------------------------
The OP_CHAR opcode is followed by a single character that is to be matched
casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
the character may be more than one code unit long. In UTF-32 mode, characters
are always exactly one code unit long.
casefully. For caseless matching of characters that have at most two
case-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the
character may be more than one code unit long. In UTF-32 mode, characters are
always exactly one code unit long.
If there is only one character in a character class, OP_CHAR or OP_CHARI is
used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is,
for something like [^a]).
Caseless matching (positive or negative) of characters that have more than two
case-equivalent code points (which is possible only in UTF mode) is handled by
compiling a Unicode property item (see below), with the pseudo-property
PT_CLIST. The value of this property is an offset in a vector called
"ucd_caseless_sets" which identifies the start of a short list of equivalent
characters, terminated by the value NOTACHAR (0xffffffff).
Repeating single characters
---------------------------
@ -520,7 +535,8 @@ Each is followed by two code units that encode the desired property as a type
and a value. The types are a set of #defines of the form PT_xxx, and the values
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
The value is relevant only for PT_GC (General Category), PT_PC (Particular
Category), and PT_SC (Script).
Category), PT_SC (Script), and the pseudo-property PT_CLIST, which is used to
identify a list of case-equivalent characters when there are three or more.
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
@ -532,7 +548,10 @@ Character classes
If there is only one character in a class, OP_CHAR or OP_CHARI is used for a
positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
something like [^a]).
something like [^a]), except when caselessly matching a character that has more
than two case-equivalent code points (which can happen only in UTF mode). In
this case a Unicode property item is used, as described above in "Matching
literal characters".
A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated,
negated, single-character classes. The normal single-character opcodes
@ -553,8 +572,8 @@ do.
For classes containing characters with values greater than 255 or that contain
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
code points are less than 256, followed by a list of pairs (for a range) and/or
single characters and/or properties. In caseless mode, both cases are
explicitly listed.
single characters and/or properties. In caseless mode, all equivalent
characters are explicitly listed.
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
opcode and its data. This is followed by a code unit containing flag bits:
@ -611,8 +630,8 @@ opcode to see if it is one of these:
OP_CRMINRANGE
OP_CRPOSRANGE
All but the last three are single-code-unit items, with no data. The others are
followed by the minimum and maximum repeat counts.
All but the last three are single-code-unit items, with no data. The range
opcodes are followed by the minimum and maximum repeat counts.
Brackets and alternation
@ -627,16 +646,17 @@ myself, can be round, square, curly, or pointy. Hence this usage rather than
Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
bracket opcode is followed by a LINK_SIZE value which gives the offset to the
next alternative OP_ALT or, if there aren't any branches, to the matching
OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset
to the next one, or to the OP_KET opcode. For capturing brackets, the bracket
number is a count that immediately follows the offset.
next alternative OP_ALT or, if there aren't any branches, to the terminating
opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the
next one, or to the final opcode. For capturing brackets, the bracket number is
a count that immediately follows the offset.
OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN
and OP_KETRMAX are used for indefinite repetitions, minimally or maximally
respectively (see below for possessive repetitions). All three are followed by
a LINK_SIZE value giving (as a positive number) the offset back to the matching
bracket opcode.
There are several opcodes that mark the end of a subpattern group. OP_KET is
used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
OP_KETRMAX are used for indefinite repetitions, minimally or maximally
respectively, and OP_KETRPOS for possessive repetitions (see below for more
details). All four are followed by a LINK_SIZE value giving (as a positive
number) the offset back to the matching bracket opcode.
If a subpattern is quantified such that it is permitted to match zero times, it
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
@ -725,11 +745,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
or OP_FALSE.
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
must start with an assertion, whose opcode normally immediately follows OP_COND
or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted
immediately before the assertion. It is also possible to insert a manual
callout at this point. Only assertion conditions may have callouts preceding
the condition.
must start with a parenthesized assertion, whose opcode normally immediately
follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
callout is inserted immediately before the assertion. It is also possible to
insert a manual callout at this point. Only assertion conditions may have
callouts preceding the condition.
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
parts of the pattern, so this is another opcode that may appear as a condition.
@ -758,12 +778,12 @@ treated as (?1)(?1)(?:(?1)){0,2}.
Callouts
--------
A callout can nowadays have either a numerical argument or a string argument.
These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
followed by two LINK_SIZE values giving the offset in the pattern string to the
start of the following item, and another count giving the length of this item.
These values make it possible for pcre2test to output useful tracing
information using callouts.
A callout may have either a numerical argument or a string argument. These use
OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by
two LINK_SIZE values giving the offset in the pattern string to the start of
the following item, and another count giving the length of this item. These
values make it possible for pcre2test to output useful tracing information
using callouts.
In the case of a numeric callout, after these two values there is a single code
unit containing the callout number, in the range 0-255, with 255 being used for
@ -790,8 +810,8 @@ Opcode table checking
---------------------
The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is
not a real opcode, but is used to check that tables indexed by opcode are the
correct length, in order to catch updating errors.
not a real opcode, but is used to check at compile time that tables indexed by
opcode are the correct length, in order to catch updating errors.
Philip Hazel
17 March 2017
21 April 2017