diff --git a/Makefile.am b/Makefile.am index 57ab2fc..5b62573 100644 --- a/Makefile.am +++ b/Makefile.am @@ -36,6 +36,11 @@ dist_html_DATA = \ doc/html/pcre2matching.html \ doc/html/pcre2partial.html \ doc/html/pcre2pattern.html \ + doc/html/pcre2perform.html \ + doc/html/pcre2posix.html \ + doc/html/pcre2sample.html \ + doc/html/pcre2stack.html \ + doc/html/pcre2syntax.html \ doc/html/pcre2test.html \ doc/html/pcre2unicode.html @@ -66,12 +71,7 @@ dist_html_DATA = \ # doc/html/pcre2_utf16_to_host_byte_order.html \ # doc/html/pcre2_utf32_to_host_byte_order.html \ # doc/html/pcre2_version.html \ -# doc/html/pcre2perform.html \ -# doc/html/pcre2posix.html \ -# doc/html/pcre2precompile.html \ -# doc/html/pcre2sample.html \ -# doc/html/pcre2stack.html \ -# doc/html/pcre2syntax.html +# doc/html/pcre2precompile.html # FIXME dist_man_MANS = \ @@ -88,6 +88,11 @@ dist_man_MANS = \ doc/pcre2matching.3 \ doc/pcre2partial.3 \ doc/pcre2pattern.3 \ + doc/pcre2perform.3 \ + doc/pcre2posix.3 \ + doc/pcre2sample.3 \ + doc/pcre2stack.3 \ + doc/pcre2syntax.3 \ doc/pcre2test.1 \ doc/pcre2unicode.3 @@ -120,12 +125,7 @@ dist_man_MANS = \ # doc/pcre2_utf16_to_host_byte_order.3 \ # doc/pcre2_utf32_to_host_byte_order.3 \ # doc/pcre2_version.3 \ -# doc/pcre2perform.3 \ -# doc/pcre2posix.3 \ -# doc/pcre2precompile.3 \ -# doc/pcre2sample.3 \ -# doc/pcre2stack.3 \ -# doc/pcre2syntax.3 +# doc/pcre2precompile.3 # The Libtool libraries to install. We'll add to this later. diff --git a/doc/html/pcre2perform.html b/doc/html/pcre2perform.html new file mode 100644 index 0000000..0e182e7 --- /dev/null +++ b/doc/html/pcre2perform.html @@ -0,0 +1,196 @@ + +
++Return to the PCRE2 index page. +
+
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+
+
+PCRE2 PERFORMANCE
+
+
+Two aspects of performance are discussed below: memory usage and processing +time. The way you express your pattern as a regular expression can affect both +of them. +
++Patterns are compiled by PCRE2 into a reasonably efficient interpretive code, +so that most simple patterns do not use much memory. However, there is one case +where the memory usage of a compiled pattern can be unexpectedly large. If a +parenthesized subpattern has a quantifier with a minimum greater than 1 and/or +a limited maximum, the whole subpattern is repeated in the compiled code. For +example, the pattern +
+ (abc|def){2,4} ++is compiled as if it were +
+ (abc|def)(abc|def)((abc|def)(abc|def)?)? ++(Technical aside: It is done this way so that backtrack points within each of +the repetitions can be independently maintained.) + +
+For regular expressions whose quantifiers use only small numbers, this is not +usually a problem. However, if the numbers are large, and particularly if such +repetitions are nested, the memory usage can become an embarrassment. For +example, the very simple pattern +
+ ((ab){1,1000}c){1,3} ++uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled +with its default internal pointer size of two bytes, the size limit on a +compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this +is reached with the above pattern if the outer repetition is increased from 3 +to 4. PCRE2 can be compiled to use larger internal pointers and thus handle +larger compiled patterns, but it is better to try to rewrite your pattern to +use less memory if you can. + +
+One way of reducing the memory usage for such patterns is to make use of +PCRE2's +"subroutine" +facility. Re-writing the above pattern as +
+ ((ab)(?2){0,999}c)(?1){0,2} ++reduces the memory requirements to 18K, and indeed it remains under 20K even +with the outer repetition increased to 100. However, this pattern is not +exactly equivalent, because the "subroutine" calls are treated as +atomic groups +into which there can be no backtracking if there is a subsequent matching +failure. Therefore, PCRE2 cannot do this kind of rewriting automatically. +Furthermore, there is a noticeable loss of speed when executing the modified +pattern. Nevertheless, if the atomic grouping is not a problem and the loss of +speed is acceptable, this kind of rewriting will allow you to process patterns +that PCRE2 cannot otherwise handle. + +
+When pcre2_match() is used for matching, certain kinds of pattern can +cause it to use large amounts of the process stack. In some environments the +default process stack is quite small, and if it runs out the result is often +SIGSEGV. Rewriting your pattern can often help. The +pcre2stack +documentation discusses this issue in detail. +
++Certain items in regular expression patterns are processed more efficiently +than others. It is more efficient to use a character class like [aeiou] than a +set of single-character alternatives such as (a|e|i|o|u). In general, the +simplest construction that provides the required behaviour is usually the most +efficient. Jeffrey Friedl's book contains a lot of useful general discussion +about optimizing regular expressions for efficient performance. This document +contains a few observations about PCRE2. +
++Using Unicode character properties (the \p, \P, and \X escapes) is slow, +because PCRE2 has to use a multi-stage table lookup whenever it needs a +character's property. If you can find an alternative pattern that does not use +character properties, it will probably be faster. +
++By default, the escape sequences \b, \d, \s, and \w, and the POSIX +character classes such as [:alpha:] do not use Unicode properties, partly for +backwards compatibility, and partly for performance reasons. However, you can +set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode +character properties to be used. This can double the matching time for items +such as \d, when matched with pcre2_match(); the performance loss is +less with a DFA matching function, and in both cases there is not much +difference for \b. +
++When a pattern begins with .* not in parentheses, or in parentheses that are +not the subject of a backreference, and the PCRE2_DOTALL option is set, the +pattern is implicitly anchored by PCRE2, since it can match only at the start +of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make +this optimization, because the dot metacharacter does not then match a newline, +and if the subject string contains newlines, the pattern may match from the +character immediately following one of them instead of from the very start. For +example, the pattern +
+ .*second ++matches the subject "first\nand second" (where \n stands for a newline +character), with the match starting at the seventh character. In order to do +this, PCRE2 has to retry the match starting after every newline in the subject. + +
+If you are using such a pattern with subject strings that do not contain +newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting +the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2 +from having to scan along the subject looking for a newline to restart at. +
++Beware of patterns that contain nested indefinite repeats. These can take a +long time to run when applied to a string that does not match. Consider the +pattern fragment +
+ ^(a+)* ++This can match "aaaa" in 16 different ways, and this number increases very +rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 +times, and for each of those cases other than 0 or 4, the + repeats can match +different numbers of times.) When the remainder of the pattern is such that the +entire match is going to fail, PCRE2 has in principle to try every possible +variation, and this can take an extremely long time, even for relatively short +strings. + +
+An optimization catches some of the more simple cases such as +
+ (a+)*b ++where a literal character follows. Before embarking on the standard matching +procedure, PCRE2 checks that there is a "b" later in the subject string, and if +there is not, it fails the match immediately. However, when there is no +following literal this optimization cannot be used. You can see the difference +by comparing the behaviour of +
+ (a+)*\d ++with the pattern above. The former gives a failure almost instantly when +applied to a whole line of "a" characters, whereas the latter takes an +appreciable time with strings longer than about 20 characters. + +
+In many cases, the solution to this kind of performance issue is to use an +atomic group or a possessive quantifier. +
+
+Philip Hazel
+
+University Computing Service
+
+Cambridge CB2 3QH, England.
+
+
+Last updated: 20 October 2014
+
+Copyright © 1997-2014 University of Cambridge.
+
+
+Return to the PCRE2 index page. +
diff --git a/doc/html/pcre2posix.html b/doc/html/pcre2posix.html new file mode 100644 index 0000000..93829b3 --- /dev/null +++ b/doc/html/pcre2posix.html @@ -0,0 +1,292 @@ + + ++Return to the PCRE2 index page. +
+
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+
+
+#include <pcre2posix.h> +
+
+int regcomp(regex_t *preg, const char *pattern,
+ int cflags);
+
+
+int regexec(const regex_t *preg, const char *string,
+ size_t nmatch, regmatch_t pmatch[], int eflags);
+
+
+size_t regerror(int errcode, const regex_t *preg,
+ char *errbuf, size_t errbuf_size);
+
+
+void regfree(regex_t *preg);
+
+This set of functions provides a POSIX-style API for the PCRE2 regular +expression 8-bit library. See the +pcre2api +documentation for a description of PCRE2's native API, which contains much +additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit +and 32-bit libraries. +
++The functions described here are just wrapper functions that ultimately call +the PCRE2 native API. Their prototypes are defined in the pcre2posix.h +header file, and on Unix systems the library itself is called +libpcre2-posix.a, so can be accessed by adding -lpcre2-posix to the +command for linking an application that uses them. Because the POSIX functions +call the native ones, it is also necessary to add -lpcre2-8. +
++Those POSIX option bits that can reasonably be mapped to PCRE2 native options +have been implemented. In addition, the option REG_EXTENDED is defined with the +value zero. This has no effect, but since programs that are written to the +POSIX interface often use it, this makes it easier to slot in PCRE2 as a +replacement library. Other POSIX options are not even defined. +
++There are also some other options that are not defined by POSIX. These have +been added at the request of users who want to make use of certain +PCRE2-specific features via the POSIX calling interface. +
++When PCRE2 is called via these functions, it is only the API that is POSIX-like +in style. The syntax and semantics of the regular expressions themselves are +still those of Perl, subject to the setting of various PCRE2 options, as +described below. "POSIX-like in style" means that the API approximates to the +POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding +domains it is probably even less compatible. +
++The header for these functions is supplied as pcre2posix.h to avoid any +potential clash with other POSIX libraries. It can, of course, be renamed or +aliased as regex.h, which is the "correct" name. It provides two +structure types, regex_t for compiled internal forms, and +regmatch_t for returning captured substrings. It also defines some +constants whose names start with "REG_"; these are used for setting options and +identifying error codes. +
++The function regcomp() is called to compile a pattern into an +internal form. The pattern is a C string terminated by a binary zero, and +is passed in the argument pattern. The preg argument is a pointer +to a regex_t structure that is used as a base for storing information +about the compiled regular expression. +
++The argument cflags is either zero, or contains one or more of the bits +defined by the following macros: +
+ REG_DOTALL ++The PCRE2_DOTALL option is set when the regular expression is passed for +compilation to the native function. Note that REG_DOTALL is not part of the +POSIX standard. +
+ REG_ICASE ++The PCRE2_CASELESS option is set when the regular expression is passed for +compilation to the native function. +
+ REG_NEWLINE ++The PCRE2_MULTILINE option is set when the regular expression is passed for +compilation to the native function. Note that this does not mimic the +defined POSIX behaviour for REG_NEWLINE (see the following section). +
+ REG_NOSUB ++The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed +for compilation to the native function. In addition, when a pattern that is +compiled with this flag is passed to regexec() for matching, the +nmatch and pmatch arguments are ignored, and no captured strings +are returned. +
+ REG_UCP ++The PCRE2_UCP option is set when the regular expression is passed for +compilation to the native function. This causes PCRE2 to use Unicode properties +when matchine \d, \w, etc., instead of just recognizing ASCII values. Note +that REG_UCP is not part of the POSIX standard. +
+ REG_UNGREEDY ++The PCRE2_UNGREEDY option is set when the regular expression is passed for +compilation to the native function. Note that REG_UNGREEDY is not part of the +POSIX standard. +
+ REG_UTF ++The PCRE2_UTF option is set when the regular expression is passed for +compilation to the native function. This causes the pattern itself and all data +strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF +is not part of the POSIX standard. + +
+In the absence of these flags, no options are passed to the native function. +This means the the regex is compiled with PCRE2 default semantics. In +particular, the way it handles newline characters in the subject string is the +Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only +some of the effects specified for REG_NEWLINE. It does not affect the way +newlines are matched by the dot metacharacter (they are not) or by a negative +class such as [^a] (they are). +
++The yield of regcomp() is zero on success, and non-zero otherwise. The +preg structure is filled in on success, and one member of the structure +is public: re_nsub contains the number of capturing subpatterns in +the regular expression. Various error codes are defined in the header file. +
++NOTE: If the yield of regcomp() is non-zero, you must not attempt to +use the contents of the preg structure. If, for example, you pass it to +regexec(), the result is undefined and your program is likely to crash. +
++This area is not simple, because POSIX and Perl take different views of things. +It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was +never intended to be a POSIX engine. The following table lists the different +possibilities for matching newline characters in PCRE2: +
+ Default Change with + + . matches newline no PCRE2_DOTALL + newline matches [^a] yes not changeable + $ matches \n at end yes PCRE2_DOLLAR_ENDONLY + $ matches \n in middle no PCRE2_MULTILINE + ^ matches \n in middle no PCRE2_MULTILINE ++This is the equivalent table for POSIX: +
+ Default Change with + + . matches newline yes REG_NEWLINE + newline matches [^a] yes REG_NEWLINE + $ matches \n at end no REG_NEWLINE + $ matches \n in middle no REG_NEWLINE + ^ matches \n in middle no REG_NEWLINE ++PCRE2's behaviour is the same as Perl's, except that there is no equivalent for +PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop +newline from matching [^a]. + +
+The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and +PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for +the REG_NEWLINE action. +
++The function regexec() is called to match a compiled pattern preg +against a given string, which is by default terminated by a zero byte +(but see REG_STARTEND below), subject to the options in eflags. These can +be: +
+ REG_NOTBOL ++The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching +function. +
+ REG_NOTEMPTY ++The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching +function. Note that REG_NOTEMPTY is not part of the POSIX standard. However, +setting this option can give more POSIX-like behaviour in some situations. +
+ REG_NOTEOL ++The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching +function. +
+ REG_STARTEND ++The string is considered to start at string + pmatch[0].rm_so and +to have a terminating NUL located at string + pmatch[0].rm_eo +(there need not actually be a NUL at that location), regardless of the value of +nmatch. This is a BSD extension, compatible with but not specified by +IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software +intended to be portable to other systems. Note that a non-zero rm_so does +not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not +how it is matched. + +
+If the pattern was compiled with the REG_NOSUB flag, no data about any matched +strings is returned. The nmatch and pmatch arguments of +regexec() are ignored. +
++If the value of nmatch is zero, or if the value pmatch is NULL, +no data about any matched strings is returned. +
++Otherwise,the portion of the string that was matched, and also any captured +substrings, are returned via the pmatch argument, which points to an +array of nmatch structures of type regmatch_t, containing the +members rm_so and rm_eo. These contain the byte offset to the first +character of each substring and the offset to the first character after the end +of each substring, respectively. The 0th element of the vector relates to the +entire portion of string that was matched; subsequent elements relate to +the capturing subpatterns of the regular expression. Unused entries in the +array have both structure members set to -1. +
++A successful match yields a zero return; various error codes are defined in the +header file, of which REG_NOMATCH is the "expected" failure code. +
++The regerror() function maps a non-zero errorcode from either +regcomp() or regexec() to a printable message. If preg is not +NULL, the error should have arisen from the use of that structure. A message +terminated by a binary zero is placed in errbuf. The length of the +message, including the zero, is limited to errbuf_size. The yield of the +function is the size of buffer needed to hold the whole message. +
++Compiling a regular expression causes memory to be allocated and associated +with the preg structure. The function regfree() frees all such +memory, after which preg may no longer be used as a compiled expression. +
+
+Philip Hazel
+
+University Computing Service
+
+Cambridge CB2 3QH, England.
+
+
+Last updated: 20 October 2014
+
+Copyright © 1997-2014 University of Cambridge.
+
+
+Return to the PCRE2 index page. +
diff --git a/doc/html/pcre2sample.html b/doc/html/pcre2sample.html new file mode 100644 index 0000000..1cba300 --- /dev/null +++ b/doc/html/pcre2sample.html @@ -0,0 +1,106 @@ + + ++Return to the PCRE2 index page. +
+
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+
+
+PCRE2 SAMPLE PROGRAM
+
+
+A simple, complete demonstration program to get you started with using PCRE2 is +supplied in the file pcre2demo.c in the src directory in the PCRE2 +distribution. A listing of this program is given in the +pcre2demo +documentation. If you do not have a copy of the PCRE2 distribution, you can +save this listing to re-create the contents of pcre2demo.c. +
++The demonstration program, which uses the PCRE2 8-bit library, compiles the +regular expression that is its first argument, and matches it against the +subject string in its second argument. No PCRE2 options are set, and default +character tables are used. If matching succeeds, the program outputs the +portion of the subject that matched, together with the contents of any captured +substrings. +
++If the -g option is given on the command line, the program then goes on to +check for further matches of the same regular expression in the same subject +string. The logic is a little bit tricky because of the possibility of matching +an empty string. Comments in the code explain what is going on. +
++If PCRE2 is installed in the standard include and library directories for your +operating system, you should be able to compile the demonstration program using +this command: +
+ gcc -o pcre2demo pcre2demo.c -lpcre2-8 ++If PCRE2 is installed elsewhere, you may need to add additional options to the +command line. For example, on a Unix-like system that has PCRE2 installed in +/usr/local, you can compile the demonstration program using a command +like this: +
+ gcc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8 + ++ +
+Once you have compiled and linked the demonstration program, you can run simple +tests like this: +
+ ./pcre2demo 'cat|dog' 'the cat sat on the mat' + ./pcre2demo -g 'cat|dog' 'the dog sat on the cat' ++Note that there is a much more comprehensive test program, called +pcre2test, +which supports many more facilities for testing regular expressions using the +PCRE2 libraries. The +pcre2demo +program is provided as a simple coding example. + +
+If you try to run +pcre2demo +when PCRE2 is not installed in the standard library directory, you may get an +error like this on some operating systems (e.g. Solaris): +
+ ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory ++This is caused by the way shared library support works on those systems. You +need to add +
+ -R/usr/local/lib ++(for example) to the compile command to get round this problem. + +
+Philip Hazel
+
+University Computing Service
+
+Cambridge CB2 3QH, England.
+
+
+Last updated: 20 October 2014
+
+Copyright © 1997-2014 University of Cambridge.
+
+
+Return to the PCRE2 index page. +
diff --git a/doc/html/pcre2stack.html b/doc/html/pcre2stack.html new file mode 100644 index 0000000..aff072f --- /dev/null +++ b/doc/html/pcre2stack.html @@ -0,0 +1,203 @@ + + ++Return to the PCRE2 index page. +
+
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+
+
+PCRE2 DISCUSSION OF STACK USAGE
+
+
+When you call pcre2_match(), it makes use of an internal function called +match(). This calls itself recursively at branch points in the pattern, +in order to remember the state of the match so that it can back up and try a +different alternative after a failure. As matching proceeds deeper and deeper +into the tree of possibilities, the recursion depth increases. The +match() function is also called in other circumstances, for example, +whenever a parenthesized sub-pattern is entered, and in certain cases of +repetition. +
++Not all calls of match() increase the recursion depth; for an item such +as a* it may be called several times at the same level, after matching +different numbers of a's. Furthermore, in a number of cases where the result of +the recursive call would immediately be passed back as the result of the +current call (a "tail recursion"), the function is just restarted instead. +
++The above comments apply when pcre2_match() is run in its normal +interpretive manner. If the compiled pattern was processed by +pcre2_jit_compile(), and just-in-time compiling was successful, and the +options passed to pcre2_match() were not incompatible, the matching +process uses the JIT-compiled code instead of the match() function. In +this case, the memory requirements are handled entirely differently. See the +pcre2jit +documentation for details. +
++The pcre2_dfa_match() function operates in a different way to +pcre2_match(), and uses recursion only when there is a regular expression +recursion or subroutine call in the pattern. This includes the processing of +assertion and "once-only" subpatterns, which are handled like subroutine calls. +Normally, these are never very deep, and the limit on the complexity of +pcre2_dfa_match() is controlled by the amount of workspace it is given. +However, it is possible to write patterns with runaway infinite recursions; +such patterns will cause pcre2_dfa_match() to run out of stack. At +present, there is no protection against this. +
++The comments that follow do NOT apply to pcre2_dfa_match(); they are +relevant only for pcre2_match() without the JIT optimization. +
++Each time that the internal match() function is called recursively, it +uses memory from the process stack. For certain kinds of pattern and data, very +large amounts of stack may be needed, despite the recognition of "tail +recursion". You can often reduce the amount of recursion, and therefore the +amount of stack used, by modifying the pattern that is being matched. Consider, +for example, this pattern: +
+ ([^<]|<(?!inet))+ ++It matches from wherever it starts until it encounters "<inet" or the end of +the data, and is the kind of pattern that might be used when processing an XML +file. Each iteration of the outer parentheses matches either one character that +is not "<" or a "<" that is not followed by "inet". However, each time a +parenthesis is processed, a recursion occurs, so this formulation uses a stack +frame for each matched character. For a long string, a lot of stack is +required. Consider now this rewritten pattern, which matches exactly the same +strings: +
+ ([^<]++|<(?!inet))+ ++This uses very much less stack, because runs of characters that do not contain +"<" are "swallowed" in one item inside the parentheses. Recursion happens only +when a "<" character that is not followed by "inet" is encountered (and we +assume this is relatively rare). A possessive quantifier is used to stop any +backtracking into the runs of non-"<" characters, but that is not related to +stack usage. + +
+This example shows that one way of avoiding stack problems when matching long +subject strings is to write repeated parenthesized subpatterns to match more +than one character whenever possible. +
++In environments where stack memory is constrained, you might want to compile +PCRE2 to use heap memory instead of stack for remembering back-up points when +pcre2_match() is running. This makes it run more slowly, however. Details +of how to do this are given in the +pcre2build +documentation. When built in this way, instead of using the stack, PCRE2 +gets memory for remembering backup points from the heap. By default, the memory +is obtained by calling the system malloc() function, but you can arrange +to supply your own memory management function. For details, see the section +entitled +"The match context" +in the +pcre2api +documentation. Since the block sizes are always the same, it may be possible to +implement customized a memory handler that is more efficient than the standard +function. The memory blocks obtained for this purpose are retained and re-used +if possible while pcre2_match() is running. They are all freed just +before it exits. +
++You can set limits on the number of times the internal match() function +is called, both in total and recursively. If a limit is exceeded, +pcre2_match() returns an error code. Setting suitable limits should +prevent it from running out of stack. The default values of the limits are very +large, and unlikely ever to operate. They can be changed when PCRE2 is built, +and they can also be set when pcre2_match() is called. For details of +these interfaces, see the +pcre2build +documentation and the section entitled +"The match context" +in the +pcre2api +documentation. +
++As a very rough rule of thumb, you should reckon on about 500 bytes per +recursion. Thus, if you want to limit your stack usage to 8Mb, you should set +the limit at 16000 recursions. A 64Mb stack, on the other hand, can support +around 128000 recursions. +
++The pcre2test test program has a modifier called "find_limits" which, if +applied to a subject line, causes it to find the smallest limits that allow a a +pattern to match. This is done by calling pcre2_match() repeatedly with +different limits. +
++In Unix-like environments, there is not often a problem with the stack unless +very long strings are involved, though the default limit on stack size varies +from system to system. Values from 8Mb to 64Mb are common. You can find your +default limit by running the command: +
+ ulimit -s ++Unfortunately, the effect of running out of stack is often SIGSEGV, though +sometimes a more explicit error message is given. You can normally increase the +limit on stack size by code such as this: +
+ struct rlimit rlim; + getrlimit(RLIMIT_STACK, &rlim); + rlim.rlim_cur = 100*1024*1024; + setrlimit(RLIMIT_STACK, &rlim); ++This reads the current limits (soft and hard) using getrlimit(), then +attempts to increase the soft limit to 100Mb using setrlimit(). You must +do this before calling pcre2_match(). + +
+Using setrlimit(), as described above, should also work on Mac OS X. It +is also possible to set a stack size when linking a program. There is a +discussion about stack sizes in Mac OS X at this web site: +http://developer.apple.com/qa/qa2005/qa1419.html. +
+
+Philip Hazel
+
+University Computing Service
+
+Cambridge CB2 3QH, England.
+
+
+Last updated: 20 October 2014
+
+Copyright © 1997-2014 University of Cambridge.
+
+
+Return to the PCRE2 index page. +
diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html new file mode 100644 index 0000000..c240ca9 --- /dev/null +++ b/doc/html/pcre2syntax.html @@ -0,0 +1,561 @@ + + ++Return to the PCRE2 index page. +
+
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+
+
+The full syntax and semantics of the regular expressions that are supported by +PCRE2 are described in the +pcre2pattern +documentation. This document contains a quick-reference summary of the syntax. +
++
+ \x where x is non-alphanumeric is a literal x + \Q...\E treat enclosed characters as literal ++ +
+
+ \a alarm, that is, the BEL character (hex 07) + \cx "control-x", where x is any ASCII character + \e escape (hex 1B) + \f form feed (hex 0C) + \n newline (hex 0A) + \r carriage return (hex 0D) + \t tab (hex 09) + \0dd character with octal code 0dd + \ddd character with octal code ddd, or backreference + \o{ddd..} character with octal code ddd.. + \xhh character with hex code hh + \x{hhh..} character with hex code hhh.. ++Note that \0dd is always an octal code, and that \8 and \9 are the literal +characters "8" and "9". + +
+
+ . any character except newline; + in dotall mode, any character whatsoever + \C one data unit, even in UTF mode (best avoided) + \d a decimal digit + \D a character that is not a decimal digit + \h a horizontal white space character + \H a character that is not a horizontal white space character + \N a character that is not a newline + \p{xx} a character with the xx property + \P{xx} a character without the xx property + \R a newline sequence + \s a white space character + \S a character that is not a white space character + \v a vertical white space character + \V a character that is not a vertical white space character + \w a "word" character + \W a "non-word" character + \X a Unicode extended grapheme cluster ++By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode +or in the 16-bit and 32-bit libraries. However, if locale-specific matching is +happening, \s and \w may also match characters with code points in the range +128-255. If the PCRE2_UCP option is set, the behaviour of these escape +sequences is changed to use Unicode properties and they match many more +characters. + +
+
+ C Other + Cc Control + Cf Format + Cn Unassigned + Co Private use + Cs Surrogate + + L Letter + Ll Lower case letter + Lm Modifier letter + Lo Other letter + Lt Title case letter + Lu Upper case letter + L& Ll, Lu, or Lt + + M Mark + Mc Spacing mark + Me Enclosing mark + Mn Non-spacing mark + + N Number + Nd Decimal number + Nl Letter number + No Other number + + P Punctuation + Pc Connector punctuation + Pd Dash punctuation + Pe Close punctuation + Pf Final punctuation + Pi Initial punctuation + Po Other punctuation + Ps Open punctuation + + S Symbol + Sc Currency symbol + Sk Modifier symbol + Sm Mathematical symbol + So Other symbol + + Z Separator + Zl Line separator + Zp Paragraph separator + Zs Space separator ++ +
+
+ Xan Alphanumeric: union of properties L and N + Xps POSIX space: property Z or tab, NL, VT, FF, CR + Xsp Perl space: property Z or tab, NL, VT, FF, CR + Xuc Univerally-named character: one that can be + represented by a Universal Character Name + Xwd Perl word: property Xan or underscore ++Perl and POSIX space are now the same. Perl added VT to its space character set +at release 5.18. + +
+Arabic, +Armenian, +Avestan, +Balinese, +Bamum, +Bassa_Vah, +Batak, +Bengali, +Bopomofo, +Brahmi, +Braille, +Buginese, +Buhid, +Canadian_Aboriginal, +Carian, +Caucasian_Albanian, +Chakma, +Cham, +Cherokee, +Common, +Coptic, +Cuneiform, +Cypriot, +Cyrillic, +Deseret, +Devanagari, +Duployan, +Egyptian_Hieroglyphs, +Elbasan, +Ethiopic, +Georgian, +Glagolitic, +Gothic, +Grantha, +Greek, +Gujarati, +Gurmukhi, +Han, +Hangul, +Hanunoo, +Hebrew, +Hiragana, +Imperial_Aramaic, +Inherited, +Inscriptional_Pahlavi, +Inscriptional_Parthian, +Javanese, +Kaithi, +Kannada, +Katakana, +Kayah_Li, +Kharoshthi, +Khmer, +Khojki, +Khudawadi, +Lao, +Latin, +Lepcha, +Limbu, +Linear_A, +Linear_B, +Lisu, +Lycian, +Lydian, +Mahajani, +Malayalam, +Mandaic, +Manichaean, +Meetei_Mayek, +Mende_Kikakui, +Meroitic_Cursive, +Meroitic_Hieroglyphs, +Miao, +Modi, +Mongolian, +Mro, +Myanmar, +Nabataean, +New_Tai_Lue, +Nko, +Ogham, +Ol_Chiki, +Old_Italic, +Old_North_Arabian, +Old_Permic, +Old_Persian, +Old_South_Arabian, +Old_Turkic, +Oriya, +Osmanya, +Pahawh_Hmong, +Palmyrene, +Pau_Cin_Hau, +Phags_Pa, +Phoenician, +Psalter_Pahlavi, +Rejang, +Runic, +Samaritan, +Saurashtra, +Sharada, +Shavian, +Siddham, +Sinhala, +Sora_Sompeng, +Sundanese, +Syloti_Nagri, +Syriac, +Tagalog, +Tagbanwa, +Tai_Le, +Tai_Tham, +Tai_Viet, +Takri, +Tamil, +Telugu, +Thaana, +Thai, +Tibetan, +Tifinagh, +Tirhuta, +Ugaritic, +Vai, +Warang_Citi, +Yi. +
++
+ [...] positive character class + [^...] negative character class + [x-y] range (can be used for hex characters) + [[:xxx:]] positive POSIX named set + [[:^xxx:]] negative POSIX named set + + alnum alphanumeric + alpha alphabetic + ascii 0-127 + blank space or tab + cntrl control character + digit decimal digit + graph printing, excluding space + lower lower case letter + print printing, including space + punct printing, excluding alphanumeric + space white space + upper upper case letter + word same as \w + xdigit hexadecimal digit ++In PCRE2, POSIX character set names recognize only ASCII characters by default, +but some of them use Unicode properties if PCRE2_UCP is set. You can use +\Q...\E inside a character class. + +
+
+ ? 0 or 1, greedy + ?+ 0 or 1, possessive + ?? 0 or 1, lazy + * 0 or more, greedy + *+ 0 or more, possessive + *? 0 or more, lazy + + 1 or more, greedy + ++ 1 or more, possessive + +? 1 or more, lazy + {n} exactly n + {n,m} at least n, no more than m, greedy + {n,m}+ at least n, no more than m, possessive + {n,m}? at least n, no more than m, lazy + {n,} n or more, greedy + {n,}+ n or more, possessive + {n,}? n or more, lazy ++ +
+
+ \b word boundary + \B not a word boundary + ^ start of subject + also after internal newline in multiline mode + \A start of subject + $ end of subject + also before newline at end of subject + also before internal newline in multiline mode + \Z end of subject + also before newline at end of subject + \z end of subject + \G first matching position in subject ++ +
+
+ \K reset start of match ++\K is honoured in positive assertions, but ignored in negative ones. + +
+
+ expr|expr|expr... ++ +
+
+ (...) capturing group + (?<name>...) named capturing group (Perl) + (?'name'...) named capturing group (Perl) + (?P<name>...) named capturing group (Python) + (?:...) non-capturing group + (?|...) non-capturing group; reset group numbers for + capturing groups in each alternative ++ +
+
+ (?>...) atomic, non-capturing group ++ +
+
+ (?#....) comment (not nestable) ++ +
+
+ (?i) caseless + (?J) allow duplicate names + (?m) multiline + (?s) single line (dotall) + (?U) default ungreedy (lazy) + (?x) extended (ignore white space) + (?-...) unset option(s) ++The following are recognized only at the very start of a pattern or after one +of the newline or \R options with similar syntax. More than one of them may +appear. +
+ (*LIMIT_MATCH=d) set the match limit to d (decimal number) + (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) + (*NOTEMPTY) set PCRE2_NOTEMPTY when matching + (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching + (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) + (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) + (*UTF) set appropriate UTF mode for the library in use + (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) ++Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the +limits set by the caller of pcre2_exec(), not increase them. + +
+These are recognized only at the very start of the pattern or after option +settings with a similar syntax. +
+ (*CR) carriage return only + (*LF) linefeed only + (*CRLF) carriage return followed by linefeed + (*ANYCRLF) all three of the above + (*ANY) any Unicode newline sequence ++ +
+These are recognized only at the very start of the pattern or after option +setting with a similar syntax. +
+ (*BSR_ANYCRLF) CR, LF, or CRLF + (*BSR_UNICODE) any Unicode newline sequence ++ +
+
+ (?=...) positive look ahead + (?!...) negative look ahead + (?<=...) positive look behind + (?<!...) negative look behind ++Each top-level branch of a look behind must be of a fixed length. + +
+
+ \n reference by number (can be ambiguous) + \gn reference by number + \g{n} reference by number + \g{-n} relative reference by number + \k<name> reference by name (Perl) + \k'name' reference by name (Perl) + \g{name} reference by name (Perl) + \k{name} reference by name (.NET) + (?P=name) reference by name (Python) ++ +
+
+ (?R) recurse whole pattern + (?n) call subpattern by absolute number + (?+n) call subpattern by relative number + (?-n) call subpattern by relative number + (?&name) call subpattern by name (Perl) + (?P>name) call subpattern by name (Python) + \g<name> call subpattern by name (Oniguruma) + \g'name' call subpattern by name (Oniguruma) + \g<n> call subpattern by absolute number (Oniguruma) + \g'n' call subpattern by absolute number (Oniguruma) + \g<+n> call subpattern by relative number (PCRE2 extension) + \g'+n' call subpattern by relative number (PCRE2 extension) + \g<-n> call subpattern by relative number (PCRE2 extension) + \g'-n' call subpattern by relative number (PCRE2 extension) ++ +
+
+ (?(condition)yes-pattern) + (?(condition)yes-pattern|no-pattern) + + (?(n)... absolute reference condition + (?(+n)... relative reference condition + (?(-n)... relative reference condition + (?(<name>)... named reference condition (Perl) + (?('name')... named reference condition (Perl) + (?(name)... named reference condition (PCRE2) + (?(R)... overall recursion condition + (?(Rn)... specific group recursion condition + (?(R&name)... specific recursion condition + (?(DEFINE)... define subpattern for reference + (?(assert)... assertion condition ++ +
+The following act immediately they are reached: +
+ (*ACCEPT) force successful match + (*FAIL) force backtrack; synonym (*F) + (*MARK:NAME) set name to be passed back; synonym (*:NAME) ++The following act only when a subsequent match failure causes a backtrack to +reach them. They all force a match failure, but they differ in what happens +afterwards. Those that advance the start-of-match point do so only if the +pattern is not anchored. +
+ (*COMMIT) overall failure, no advance of starting point + (*PRUNE) advance to next starting character + (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) + (*SKIP) advance to current matching position + (*SKIP:NAME) advance to position corresponding to an earlier + (*MARK:NAME); if not found, the (*SKIP) is ignored + (*THEN) local failure, backtrack to next alternation + (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) ++ +
+
+ (?C) callout + (?Cn) callout with data n ++ +
+pcre2pattern(3), pcre2api(3), pcre2callout(3), +pcre2matching(3), pcre2(3). +
+
+Philip Hazel
+
+University Computing Service
+
+Cambridge CB2 3QH, England.
+
+
+Last updated: 20 October 2014
+
+Copyright © 1997-2014 University of Cambridge.
+
+
+Return to the PCRE2 index page. +
diff --git a/doc/pcre2perform.3 b/doc/pcre2perform.3 new file mode 100644 index 0000000..d01b2ed --- /dev/null +++ b/doc/pcre2perform.3 @@ -0,0 +1,178 @@ +.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH "PCRE2 PERFORMANCE" +.rs +.sp +Two aspects of performance are discussed below: memory usage and processing +time. The way you express your pattern as a regular expression can affect both +of them. +. +.SH "COMPILED PATTERN MEMORY USAGE" +.rs +.sp +Patterns are compiled by PCRE2 into a reasonably efficient interpretive code, +so that most simple patterns do not use much memory. However, there is one case +where the memory usage of a compiled pattern can be unexpectedly large. If a +parenthesized subpattern has a quantifier with a minimum greater than 1 and/or +a limited maximum, the whole subpattern is repeated in the compiled code. For +example, the pattern +.sp + (abc|def){2,4} +.sp +is compiled as if it were +.sp + (abc|def)(abc|def)((abc|def)(abc|def)?)? +.sp +(Technical aside: It is done this way so that backtrack points within each of +the repetitions can be independently maintained.) +.P +For regular expressions whose quantifiers use only small numbers, this is not +usually a problem. However, if the numbers are large, and particularly if such +repetitions are nested, the memory usage can become an embarrassment. For +example, the very simple pattern +.sp + ((ab){1,1000}c){1,3} +.sp +uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled +with its default internal pointer size of two bytes, the size limit on a +compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this +is reached with the above pattern if the outer repetition is increased from 3 +to 4. PCRE2 can be compiled to use larger internal pointers and thus handle +larger compiled patterns, but it is better to try to rewrite your pattern to +use less memory if you can. +.P +One way of reducing the memory usage for such patterns is to make use of +PCRE2's +.\" HTML +.\" +"subroutine" +.\" +facility. Re-writing the above pattern as +.sp + ((ab)(?2){0,999}c)(?1){0,2} +.sp +reduces the memory requirements to 18K, and indeed it remains under 20K even +with the outer repetition increased to 100. However, this pattern is not +exactly equivalent, because the "subroutine" calls are treated as +.\" HTML +.\" +atomic groups +.\" +into which there can be no backtracking if there is a subsequent matching +failure. Therefore, PCRE2 cannot do this kind of rewriting automatically. +Furthermore, there is a noticeable loss of speed when executing the modified +pattern. Nevertheless, if the atomic grouping is not a problem and the loss of +speed is acceptable, this kind of rewriting will allow you to process patterns +that PCRE2 cannot otherwise handle. +. +. +.SH "STACK USAGE AT RUN TIME" +.rs +.sp +When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can +cause it to use large amounts of the process stack. In some environments the +default process stack is quite small, and if it runs out the result is often +SIGSEGV. Rewriting your pattern can often help. The +.\" HREF +\fBpcre2stack\fP +.\" +documentation discusses this issue in detail. +. +. +.SH "PROCESSING TIME" +.rs +.sp +Certain items in regular expression patterns are processed more efficiently +than others. It is more efficient to use a character class like [aeiou] than a +set of single-character alternatives such as (a|e|i|o|u). In general, the +simplest construction that provides the required behaviour is usually the most +efficient. Jeffrey Friedl's book contains a lot of useful general discussion +about optimizing regular expressions for efficient performance. This document +contains a few observations about PCRE2. +.P +Using Unicode character properties (the \ep, \eP, and \eX escapes) is slow, +because PCRE2 has to use a multi-stage table lookup whenever it needs a +character's property. If you can find an alternative pattern that does not use +character properties, it will probably be faster. +.P +By default, the escape sequences \eb, \ed, \es, and \ew, and the POSIX +character classes such as [:alpha:] do not use Unicode properties, partly for +backwards compatibility, and partly for performance reasons. However, you can +set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode +character properties to be used. This can double the matching time for items +such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is +less with a DFA matching function, and in both cases there is not much +difference for \eb. +.P +When a pattern begins with .* not in parentheses, or in parentheses that are +not the subject of a backreference, and the PCRE2_DOTALL option is set, the +pattern is implicitly anchored by PCRE2, since it can match only at the start +of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make +this optimization, because the dot metacharacter does not then match a newline, +and if the subject string contains newlines, the pattern may match from the +character immediately following one of them instead of from the very start. For +example, the pattern +.sp + .*second +.sp +matches the subject "first\enand second" (where \en stands for a newline +character), with the match starting at the seventh character. In order to do +this, PCRE2 has to retry the match starting after every newline in the subject. +.P +If you are using such a pattern with subject strings that do not contain +newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting +the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2 +from having to scan along the subject looking for a newline to restart at. +.P +Beware of patterns that contain nested indefinite repeats. These can take a +long time to run when applied to a string that does not match. Consider the +pattern fragment +.sp + ^(a+)* +.sp +This can match "aaaa" in 16 different ways, and this number increases very +rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 +times, and for each of those cases other than 0 or 4, the + repeats can match +different numbers of times.) When the remainder of the pattern is such that the +entire match is going to fail, PCRE2 has in principle to try every possible +variation, and this can take an extremely long time, even for relatively short +strings. +.P +An optimization catches some of the more simple cases such as +.sp + (a+)*b +.sp +where a literal character follows. Before embarking on the standard matching +procedure, PCRE2 checks that there is a "b" later in the subject string, and if +there is not, it fails the match immediately. However, when there is no +following literal this optimization cannot be used. You can see the difference +by comparing the behaviour of +.sp + (a+)*\ed +.sp +with the pattern above. The former gives a failure almost instantly when +applied to a whole line of "a" characters, whereas the latter takes an +appreciable time with strings longer than about 20 characters. +.P +In many cases, the solution to this kind of performance issue is to use an +atomic group or a possessive quantifier. +. +. +.SH AUTHOR +.rs +.sp +.nf +Philip Hazel +University Computing Service +Cambridge CB2 3QH, England. +.fi +. +. +.SH REVISION +.rs +.sp +.nf +Last updated: 20 October 2014 +Copyright (c) 1997-2014 University of Cambridge. +.fi diff --git a/doc/pcre2posix.3 b/doc/pcre2posix.3 new file mode 100644 index 0000000..5d4a721 --- /dev/null +++ b/doc/pcre2posix.3 @@ -0,0 +1,268 @@ +.TH PCRE2POSIX 3 "20 October 2014" "PCRE2 10.00" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH "SYNOPSIS" +.rs +.sp +.B #include