Documentation update.

2017-04-21 16:30:18 +00:00 · 2017-04-21 16:30:18 +00:00 · 88c15c7aa5
commit 88c15c7aa5
parent 481b69c21b
1 changed files with 65 additions and 45 deletions
--- a/110
+++ b/110
@ -48,18 +48,20 @@ Friedl's terminology.
 OK, here's the real stuff
 -------------------------
-For the set of functions that formed the original PCRE1 library (which are
+For the set of functions that formed the original PCRE1 library in 1997 (which
-unrelated to those mentioned above), I tried at first to invent an algorithm
+are unrelated to those mentioned above), I tried at first to invent an
-that used an amount of store bounded by a multiple of the number of characters
+algorithm that used an amount of store bounded by a multiple of the number of
-in the pattern, to save on compiling time. However, because of the greater
+characters in the pattern, to save on compiling time. However, because of the
-complexity in Perl regular expressions, I couldn't do this. In any case, a
+greater complexity in Perl regular expressions, I couldn't do this, even though
-first pass through the pattern is helpful for other reasons.
+the then current Perl 5.004 patterns were much simpler than those supported
 nowadays. In any case, a first pass through the pattern is helpful for other
 reasons.
 Support for 16-bit and 32-bit data strings
 -------------------------------------------
-The library can be compiled in any combination of 8-bit, 16-bit or 32-bit
+The PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit
 modes, creating up to three different libraries. In the description that
 follows, the word "short" is used for a 16-bit data quantity, and the phrase
 "code unit" is used for a quantity that is a byte in 8-bit mode, a short in
@ -122,7 +124,7 @@ all the named subpatterns and their corresponding group numbers. This means
 that the actual compile (both the memory-computing dummy run and the real
 compile) has full knowledge of group names and numbers throughout. Several
 dozen lines of messy code were eliminated, though the new pre-pass was not
-short. In particular, parsing and skipping over [] classes was complicated.
+short. In particular, parsing and skipping over [] classes is complicated.
 While working on 10.22 I realized that I could simplify yet again by moving
 more of the parsing into the pre-pass, thus avoiding doing it in two places, so
@ -149,8 +151,8 @@ of code units in the item itself. The exception is the aforementioned large
 advance to check for such values. When auto-callouts are enabled, the generous
 assumption is made that there will be a callout for each pattern code unit
 (which of course is only actually true if all code units are literals) plus one
-at the end. There is a default parsed pattern vector on the stack, but if this
+at the end. There is a default parsed pattern vector on the system stack, but
-is not big enough, heap memory is used.
+if this is not big enough, heap memory is used.
 As before, the actual compiling function is run twice, the first time to
 determine the amount of memory needed for the final compiled pattern. It
@ -343,9 +345,14 @@ Changeable options
 ------------------
 The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
-some others) may change in the middle of patterns. Their processing is handled
+others) may be changed in the middle of patterns by items such as (?i). Their
-entirely at compile time by generating different opcodes for the different
+processing is handled entirely at compile time by generating different opcodes
-settings. The runtime functions do not need to keep track of an options state.
+for the different settings. The runtime functions do not need to keep track of
 an options state.
 PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
 are tracked and processed during the parsing pre-pass. The others are handled
 from META_OPTIONS items during the main compile phase.
 Format of compiled patterns
@ -437,14 +444,22 @@ Matching literal characters
 ---------------------------
 The OP_CHAR opcode is followed by a single character that is to be matched
-casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
+casefully. For caseless matching of characters that have at most two
-the character may be more than one code unit long. In UTF-32 mode, characters
+case-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the
-are always exactly one code unit long.
+character may be more than one code unit long. In UTF-32 mode, characters are
 always exactly one code unit long.
 If there is only one character in a character class, OP_CHAR or OP_CHARI is
 used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is,
 for something like [^a]).
 Caseless matching (positive or negative) of characters that have more than two
 case-equivalent code points (which is possible only in UTF mode) is handled by
 compiling a Unicode property item (see below), with the pseudo-property
 PT_CLIST. The value of this property is an offset in a vector called
 "ucd_caseless_sets" which identifies the start of a short list of equivalent
 characters, terminated by the value NOTACHAR (0xffffffff).
 Repeating single characters
 ---------------------------
@ -520,7 +535,8 @@ Each is followed by two code units that encode the desired property as a type
 and a value. The types are a set of #defines of the form PT_xxx, and the values
 are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
 The value is relevant only for PT_GC (General Category), PT_PC (Particular
-Category), and PT_SC (Script).
+Category), PT_SC (Script), and the pseudo-property PT_CLIST, which is used to
 identify a list of case-equivalent characters when there are three or more.
 Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
 three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
@ -532,7 +548,10 @@ Character classes
 If there is only one character in a class, OP_CHAR or OP_CHARI is used for a
 positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
-something like [^a]).
+something like [^a]), except when caselessly matching a character that has more
 than two case-equivalent code points (which can happen only in UTF mode). In
 this case a Unicode property item is used, as described above in "Matching
 literal characters".
 A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated,
 negated, single-character classes. The normal single-character opcodes
@ -553,8 +572,8 @@ do.
 For classes containing characters with values greater than 255 or that contain
 \p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
 code points are less than 256, followed by a list of pairs (for a range) and/or
-single characters and/or properties. In caseless mode, both cases are
+single characters and/or properties. In caseless mode, all equivalent
-explicitly listed.
+characters are explicitly listed.
 OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
 opcode and its data. This is followed by a code unit containing flag bits:
@ -611,8 +630,8 @@ opcode to see if it is one of these:
  OP_CRMINRANGE
  OP_CRPOSRANGE
-All but the last three are single-code-unit items, with no data. The others are
+All but the last three are single-code-unit items, with no data. The range
-followed by the minimum and maximum repeat counts.
+opcodes are followed by the minimum and maximum repeat counts.
 Brackets and alternation
@ -627,16 +646,17 @@ myself, can be round, square, curly, or pointy. Hence this usage rather than
 Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
 bracket opcode is followed by a LINK_SIZE value which gives the offset to the
-next alternative OP_ALT or, if there aren't any branches, to the matching
+next alternative OP_ALT or, if there aren't any branches, to the terminating
-OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset
+opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the
-to the next one, or to the OP_KET opcode. For capturing brackets, the bracket
+next one, or to the final opcode. For capturing brackets, the bracket number is
-number is a count that immediately follows the offset.
+a count that immediately follows the offset.
-OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN
+There are several opcodes that mark the end of a subpattern group. OP_KET is
-and OP_KETRMAX are used for indefinite repetitions, minimally or maximally
+used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
-respectively (see below for possessive repetitions). All three are followed by
+OP_KETRMAX are used for indefinite repetitions, minimally or maximally
-a LINK_SIZE value giving (as a positive number) the offset back to the matching
+respectively, and OP_KETRPOS for possessive repetitions (see below for more 
-bracket opcode.
+details). All four are followed by a LINK_SIZE value giving (as a positive
 number) the offset back to the matching bracket opcode.
 If a subpattern is quantified such that it is permitted to match zero times, it
 is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
@ -725,11 +745,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
 or OP_FALSE.
 If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
-must start with an assertion, whose opcode normally immediately follows OP_COND
+must start with a parenthesized assertion, whose opcode normally immediately
-or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted
+follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
-immediately before the assertion. It is also possible to insert a manual
+callout is inserted immediately before the assertion. It is also possible to
-callout at this point. Only assertion conditions may have callouts preceding
+insert a manual callout at this point. Only assertion conditions may have
-the condition.
+callouts preceding the condition.
 A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
 parts of the pattern, so this is another opcode that may appear as a condition.
@ -758,12 +778,12 @@ treated as (?1)(?1)(?:(?1)){0,2}.
 Callouts
 --------
-A callout can nowadays have either a numerical argument or a string argument.
+A callout may have either a numerical argument or a string argument. These use
-These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
+OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by
-followed by two LINK_SIZE values giving the offset in the pattern string to the
+two LINK_SIZE values giving the offset in the pattern string to the start of
-start of the following item, and another count giving the length of this item.
+the following item, and another count giving the length of this item. These
-These values make it possible for pcre2test to output useful tracing
+values make it possible for pcre2test to output useful tracing information
-information using callouts.
+using callouts.
 In the case of a numeric callout, after these two values there is a single code
 unit containing the callout number, in the range 0-255, with 255 being used for
@ -790,8 +810,8 @@ Opcode table checking
 ---------------------
 The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is
-not a real opcode, but is used to check that tables indexed by opcode are the
+not a real opcode, but is used to check at compile time that tables indexed by
-correct length, in order to catch updating errors.
+opcode are the correct length, in order to catch updating errors.
 Philip Hazel
-17 March 2017
+21 April 2017