From 88c15c7aa5ad4005ec8f9148283bc6a0917d5691 Mon Sep 17 00:00:00 2001 From: ph10 Date: Fri, 21 Apr 2017 16:30:18 +0000 Subject: [PATCH] Documentation update. --- HACKING | 110 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 65 insertions(+), 45 deletions(-) diff --git a/HACKING b/HACKING index a314bfd..d727add 100644 --- a/HACKING +++ b/HACKING @@ -48,18 +48,20 @@ Friedl's terminology. OK, here's the real stuff ------------------------- -For the set of functions that formed the original PCRE1 library (which are -unrelated to those mentioned above), I tried at first to invent an algorithm -that used an amount of store bounded by a multiple of the number of characters -in the pattern, to save on compiling time. However, because of the greater -complexity in Perl regular expressions, I couldn't do this. In any case, a -first pass through the pattern is helpful for other reasons. +For the set of functions that formed the original PCRE1 library in 1997 (which +are unrelated to those mentioned above), I tried at first to invent an +algorithm that used an amount of store bounded by a multiple of the number of +characters in the pattern, to save on compiling time. However, because of the +greater complexity in Perl regular expressions, I couldn't do this, even though +the then current Perl 5.004 patterns were much simpler than those supported +nowadays. In any case, a first pass through the pattern is helpful for other +reasons. Support for 16-bit and 32-bit data strings ------------------------------------------- -The library can be compiled in any combination of 8-bit, 16-bit or 32-bit +The PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit modes, creating up to three different libraries. In the description that follows, the word "short" is used for a 16-bit data quantity, and the phrase "code unit" is used for a quantity that is a byte in 8-bit mode, a short in @@ -122,7 +124,7 @@ all the named subpatterns and their corresponding group numbers. This means that the actual compile (both the memory-computing dummy run and the real compile) has full knowledge of group names and numbers throughout. Several dozen lines of messy code were eliminated, though the new pre-pass was not -short. In particular, parsing and skipping over [] classes was complicated. +short. In particular, parsing and skipping over [] classes is complicated. While working on 10.22 I realized that I could simplify yet again by moving more of the parsing into the pre-pass, thus avoiding doing it in two places, so @@ -149,8 +151,8 @@ of code units in the item itself. The exception is the aforementioned large advance to check for such values. When auto-callouts are enabled, the generous assumption is made that there will be a callout for each pattern code unit (which of course is only actually true if all code units are literals) plus one -at the end. There is a default parsed pattern vector on the stack, but if this -is not big enough, heap memory is used. +at the end. There is a default parsed pattern vector on the system stack, but +if this is not big enough, heap memory is used. As before, the actual compiling function is run twice, the first time to determine the amount of memory needed for the final compiled pattern. It @@ -343,9 +345,14 @@ Changeable options ------------------ The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and -some others) may change in the middle of patterns. Their processing is handled -entirely at compile time by generating different opcodes for the different -settings. The runtime functions do not need to keep track of an options state. +others) may be changed in the middle of patterns by items such as (?i). Their +processing is handled entirely at compile time by generating different opcodes +for the different settings. The runtime functions do not need to keep track of +an options state. + +PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE +are tracked and processed during the parsing pre-pass. The others are handled +from META_OPTIONS items during the main compile phase. Format of compiled patterns @@ -437,14 +444,22 @@ Matching literal characters --------------------------- The OP_CHAR opcode is followed by a single character that is to be matched -casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes, -the character may be more than one code unit long. In UTF-32 mode, characters -are always exactly one code unit long. +casefully. For caseless matching of characters that have at most two +case-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the +character may be more than one code unit long. In UTF-32 mode, characters are +always exactly one code unit long. If there is only one character in a character class, OP_CHAR or OP_CHARI is used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is, for something like [^a]). +Caseless matching (positive or negative) of characters that have more than two +case-equivalent code points (which is possible only in UTF mode) is handled by +compiling a Unicode property item (see below), with the pseudo-property +PT_CLIST. The value of this property is an offset in a vector called +"ucd_caseless_sets" which identifies the start of a short list of equivalent +characters, terminated by the value NOTACHAR (0xffffffff). + Repeating single characters --------------------------- @@ -520,7 +535,8 @@ Each is followed by two code units that encode the desired property as a type and a value. The types are a set of #defines of the form PT_xxx, and the values are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file. The value is relevant only for PT_GC (General Category), PT_PC (Particular -Category), and PT_SC (Script). +Category), PT_SC (Script), and the pseudo-property PT_CLIST, which is used to +identify a list of case-equivalent characters when there are three or more. Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by three code units: OP_PROP or OP_NOTPROP, and then the desired property type and @@ -532,7 +548,10 @@ Character classes If there is only one character in a class, OP_CHAR or OP_CHARI is used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is, for -something like [^a]). +something like [^a]), except when caselessly matching a character that has more +than two case-equivalent code points (which can happen only in UTF mode). In +this case a Unicode property item is used, as described above in "Matching +literal characters". A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated, negated, single-character classes. The normal single-character opcodes @@ -553,8 +572,8 @@ do. For classes containing characters with values greater than 255 or that contain \p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable code points are less than 256, followed by a list of pairs (for a range) and/or -single characters and/or properties. In caseless mode, both cases are -explicitly listed. +single characters and/or properties. In caseless mode, all equivalent +characters are explicitly listed. OP_XCLASS is followed by a LINK_SIZE value containing the total length of the opcode and its data. This is followed by a code unit containing flag bits: @@ -611,8 +630,8 @@ opcode to see if it is one of these: OP_CRMINRANGE OP_CRPOSRANGE -All but the last three are single-code-unit items, with no data. The others are -followed by the minimum and maximum repeat counts. +All but the last three are single-code-unit items, with no data. The range +opcodes are followed by the minimum and maximum repeat counts. Brackets and alternation @@ -627,16 +646,17 @@ myself, can be round, square, curly, or pointy. Hence this usage rather than Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A bracket opcode is followed by a LINK_SIZE value which gives the offset to the -next alternative OP_ALT or, if there aren't any branches, to the matching -OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset -to the next one, or to the OP_KET opcode. For capturing brackets, the bracket -number is a count that immediately follows the offset. +next alternative OP_ALT or, if there aren't any branches, to the terminating +opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the +next one, or to the final opcode. For capturing brackets, the bracket number is +a count that immediately follows the offset. -OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN -and OP_KETRMAX are used for indefinite repetitions, minimally or maximally -respectively (see below for possessive repetitions). All three are followed by -a LINK_SIZE value giving (as a positive number) the offset back to the matching -bracket opcode. +There are several opcodes that mark the end of a subpattern group. OP_KET is +used for subpatterns that do not repeat indefinitely, OP_KETRMIN and +OP_KETRMAX are used for indefinite repetitions, minimally or maximally +respectively, and OP_KETRPOS for possessive repetitions (see below for more +details). All four are followed by a LINK_SIZE value giving (as a positive +number) the offset back to the matching bracket opcode. If a subpattern is quantified such that it is permitted to match zero times, it is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are @@ -725,11 +745,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE or OP_FALSE. If a condition is not a back reference, recursion test, DEFINE, or VERSION, it -must start with an assertion, whose opcode normally immediately follows OP_COND -or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted -immediately before the assertion. It is also possible to insert a manual -callout at this point. Only assertion conditions may have callouts preceding -the condition. +must start with a parenthesized assertion, whose opcode normally immediately +follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a +callout is inserted immediately before the assertion. It is also possible to +insert a manual callout at this point. Only assertion conditions may have +callouts preceding the condition. A condition that is the negative assertion (?!) is optimized to OP_FAIL in all parts of the pattern, so this is another opcode that may appear as a condition. @@ -758,12 +778,12 @@ treated as (?1)(?1)(?:(?1)){0,2}. Callouts -------- -A callout can nowadays have either a numerical argument or a string argument. -These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are -followed by two LINK_SIZE values giving the offset in the pattern string to the -start of the following item, and another count giving the length of this item. -These values make it possible for pcre2test to output useful tracing -information using callouts. +A callout may have either a numerical argument or a string argument. These use +OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by +two LINK_SIZE values giving the offset in the pattern string to the start of +the following item, and another count giving the length of this item. These +values make it possible for pcre2test to output useful tracing information +using callouts. In the case of a numeric callout, after these two values there is a single code unit containing the callout number, in the range 0-255, with 255 being used for @@ -790,8 +810,8 @@ Opcode table checking --------------------- The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is -not a real opcode, but is used to check that tables indexed by opcode are the -correct length, in order to catch updating errors. +not a real opcode, but is used to check at compile time that tables indexed by +opcode are the correct length, in order to catch updating errors. Philip Hazel -17 March 2017 +21 April 2017