Documentation update.
This commit is contained in:
parent
ca75d430dd
commit
6535e4a468
53
HACKING
53
HACKING
@ -88,10 +88,10 @@ I had a flash of inspiration as to how I could run the real compile function in
|
||||
a "fake" mode that enables it to compute how much memory it would need, while
|
||||
in most cases only ever using a small amount of working memory, and without too
|
||||
many tests of the mode that might slow it down. So I refactored the compiling
|
||||
functions to work this way. This got rid of about 600 lines of source. It
|
||||
should make future maintenance and development easier. As this was such a major
|
||||
change, I never released 6.8, instead upping the number to 7.0 (other quite
|
||||
major changes were also present in the 7.0 release).
|
||||
functions to work this way. This got rid of about 600 lines of source and made
|
||||
further maintenance and development easier. As this was such a major change, I
|
||||
never released 6.8, instead upping the number to 7.0 (other quite major changes
|
||||
were also present in the 7.0 release).
|
||||
|
||||
A side effect of this work was that the previous limit of 200 on the nesting
|
||||
depth of parentheses was removed. However, there was a downside: compiling ran
|
||||
@ -122,7 +122,7 @@ all the named subpatterns and their corresponding group numbers. This means
|
||||
that the actual compile (both the memory-computing dummy run and the real
|
||||
compile) has full knowledge of group names and numbers throughout. Several
|
||||
dozen lines of messy code were eliminated, though the new pre-pass was not
|
||||
short. In particular, parsing and skipping over [] classes is complicated.
|
||||
short. In particular, parsing and skipping over [] classes was complicated.
|
||||
|
||||
While working on 10.22 I realized that I could simplify yet again by moving
|
||||
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
|
||||
@ -162,7 +162,7 @@ simpler than before.
|
||||
Most errors can be diagnosed during the parsing scan. For those that cannot
|
||||
(for example, "lookbehind assertion is not fixed length"), the parsed code
|
||||
contains offsets into the pattern so that the actual compiling code can
|
||||
identify where errors occur.
|
||||
report where errors are.
|
||||
|
||||
|
||||
The elements of the parsed pattern vector
|
||||
@ -217,10 +217,10 @@ The following have data in the lower 16 bits, and may be followed by other data
|
||||
elements:
|
||||
|
||||
META_ALT | alternation
|
||||
META_BACKREF
|
||||
META_CAPTURE
|
||||
META_ESCAPE
|
||||
META_RECURSE
|
||||
META_BACKREF back reference
|
||||
META_CAPTURE start of capturing group
|
||||
META_ESCAPE non-literal escape sequence
|
||||
META_RECURSE recursion call
|
||||
|
||||
If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
|
||||
is the length of its branch, for which OP_REVERSE must be generated.
|
||||
@ -232,8 +232,8 @@ META_BACKREF is followed by an offset if the back reference group number is 10
|
||||
or more. The offsets of the first ocurrences of references to groups whose
|
||||
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
|
||||
occurrence is useful). On 64-bit systems this avoids using more than two parsed
|
||||
pattern elements for items such as \3. The offset is used when an error is
|
||||
given for a reference to a non-existent group.
|
||||
pattern elements for items such as \3. The offset is used when an error occurs
|
||||
because the reference is to a non-existent group.
|
||||
|
||||
META_RECURSE is always followed by an offset, for use in error messages.
|
||||
|
||||
@ -286,7 +286,7 @@ group; this is used when generating OP_REVERSE for that branch.
|
||||
META_LOOKBEHIND (?<=
|
||||
META_LOOKBEHINDNOT (?<!
|
||||
|
||||
The following are followed by two values, the minimum and maximum. Repeat
|
||||
The following are followed by two elements, the minimum and maximum. Repeat
|
||||
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
|
||||
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
|
||||
|
||||
@ -369,7 +369,7 @@ unit, the most significant unit is first.
|
||||
|
||||
In this description, we assume the "normal" compilation options. Data values
|
||||
that are counts (e.g. quantifiers) are always two bytes long in 8-bit mode
|
||||
(most significant byte first), or one code unit in 16-bit and 32-bit modes.
|
||||
(most significant byte first), and one code unit in 16-bit and 32-bit modes.
|
||||
|
||||
|
||||
Opcodes with no following data
|
||||
@ -409,7 +409,7 @@ These items are all just one unit long
|
||||
OP_ACCEPT ) These are Perl 5.10's "backtracking control
|
||||
OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing
|
||||
OP_FAIL ) parentheses, it may be preceded by one or more
|
||||
OP_PRUNE ) OP_CLOSE, each followed by a count that
|
||||
OP_PRUNE ) OP_CLOSE, each followed by a number that
|
||||
OP_SKIP ) indicates which parentheses must be closed.
|
||||
OP_THEN )
|
||||
|
||||
@ -679,7 +679,7 @@ Once-only (atomic) groups
|
||||
|
||||
These are just like other subpatterns, but they start with the opcode OP_ONCE.
|
||||
The check for matching an empty string in an unbounded repeat is handled
|
||||
entirely at runtime, so there are just this one opcode for atomic groups.
|
||||
entirely at runtime, so there is just this one opcode for atomic groups.
|
||||
|
||||
|
||||
Assertions
|
||||
@ -742,14 +742,21 @@ Recursion
|
||||
Recursion either matches the current pattern, or some subexpression. The opcode
|
||||
OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
|
||||
bracket from the start of the whole pattern. OP_RECURSE is also used for
|
||||
"subroutine" calls, even though they are not strictly a recursion. Repeated
|
||||
recursions are automatically wrapped inside OP_ONCE brackets, because otherwise
|
||||
some patterns broke them. A non-repeated recursion is not wrapped in OP_ONCE
|
||||
brackets, but it is nevertheless still treated as an atomic group.
|
||||
"subroutine" calls, even though they are not strictly a recursion. Up till
|
||||
release 10.30 recursions were treated as atomic groups, making them
|
||||
incompatible with Perl (but PCRE had then well before Perl did). From 10.30,
|
||||
backtracking into recursions is supported.
|
||||
|
||||
Repeated recursions used to be wrapped inside OP_ONCE brackets, which not only
|
||||
forced no backtracking, but also allowed repetition to be handled as for other
|
||||
bracketed groups. From 10.30 onwards, repeated recursions are duplicated for
|
||||
their minimum repetitions, and then wrapped in non-capturing brackets for the
|
||||
remainder. For example, (?1){3} is treated as (?1)(?1)(?1), and (?1){2,4} is
|
||||
treated as (?1)(?1)(?:(?1)){0,2}.
|
||||
|
||||
|
||||
Callout
|
||||
-------
|
||||
Callouts
|
||||
--------
|
||||
|
||||
A callout can nowadays have either a numerical argument or a string argument.
|
||||
These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
|
||||
@ -787,4 +794,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
|
||||
correct length, in order to catch updating errors.
|
||||
|
||||
Philip Hazel
|
||||
March 2017
|
||||
17 March 2017
|
||||
|
@ -1,10 +1,6 @@
|
||||
Building PCRE2 without using autotools
|
||||
--------------------------------------
|
||||
|
||||
This document has been converted from the PCRE1 document. I have removed a
|
||||
number of sections about building in various environments, as they applied only
|
||||
to PCRE1 and are probably out of date.
|
||||
|
||||
This document contains the following sections:
|
||||
|
||||
General
|
||||
@ -183,21 +179,9 @@ can skip ahead to the CMake section.
|
||||
|
||||
STACK SIZE IN WINDOWS ENVIRONMENTS
|
||||
|
||||
The default processor stack size of 1Mb in some Windows environments is too
|
||||
small for matching patterns that need much recursion. In particular, test 2 may
|
||||
fail because of this. Normally, running out of stack causes a crash, but there
|
||||
have been cases where the test program has just died silently. See your linker
|
||||
documentation for how to increase stack size if you experience problems. If you
|
||||
are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
|
||||
compiler, you can increase the stack size for pcre2test and pcre2grep by
|
||||
setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
|
||||
example). The Linux default of 8Mb is a reasonable choice for the stack, though
|
||||
even that can be too small for some pattern/subject combinations.
|
||||
|
||||
PCRE2 has a compile configuration option to disable the use of stack for
|
||||
recursion so that heap is used instead. However, pattern matching is
|
||||
significantly slower when this is done. There is more about stack usage in the
|
||||
"pcre2stack" documentation.
|
||||
Prior to release 10.30 the default system stack size of 1Mb in some Windows
|
||||
environments caused issues with some tests. This should no longer be the case
|
||||
for 10.30 and later releases.
|
||||
|
||||
|
||||
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
|
||||
@ -393,4 +377,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
|
||||
recommended download site.
|
||||
|
||||
=============================
|
||||
Last Updated: 13 October 2016
|
||||
Last Updated: 17 March 2017
|
||||
|
91
README
91
README
@ -15,8 +15,8 @@ subscribe or manage your subscription here:
|
||||
|
||||
https://lists.exim.org/mailman/listinfo/pcre-dev
|
||||
|
||||
Please read the NEWS file if you are upgrading from a previous release.
|
||||
The contents of this README file are:
|
||||
Please read the NEWS file if you are upgrading from a previous release. The
|
||||
contents of this README file are:
|
||||
|
||||
The PCRE2 APIs
|
||||
Documentation for PCRE2
|
||||
@ -44,8 +44,8 @@ wrappers.
|
||||
|
||||
The distribution does contain a set of C wrapper functions for the 8-bit
|
||||
library that are based on the POSIX regular expression API (see the pcre2posix
|
||||
man page). These can be found in a library called libpcre2-posix. Note that this
|
||||
just provides a POSIX calling interface to PCRE2; the regular expressions
|
||||
man page). These can be found in a library called libpcre2-posix. Note that
|
||||
this just provides a POSIX calling interface to PCRE2; the regular expressions
|
||||
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
|
||||
and does not give full access to all of PCRE2's facilities.
|
||||
|
||||
@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
|
||||
Building PCRE2 on non-Unix-like systems
|
||||
---------------------------------------
|
||||
|
||||
For a non-Unix-like system, please read the comments in the file
|
||||
NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and
|
||||
"make" you may be able to build PCRE2 using autotools in the same way as for
|
||||
many Unix-like systems.
|
||||
For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
|
||||
your system supports the use of "configure" and "make" you may be able to build
|
||||
PCRE2 using autotools in the same way as for many Unix-like systems.
|
||||
|
||||
PCRE2 can also be configured using CMake, which can be run in various ways
|
||||
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
|
||||
@ -174,19 +173,19 @@ library. They are also documented in the pcre2build man page.
|
||||
architectures. If you try to enable it on an unsupported architecture, there
|
||||
will be a compile time error.
|
||||
|
||||
. If you do not want to make use of the support for UTF-8 Unicode character
|
||||
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
||||
library, or UTF-32 Unicode character strings in the 32-bit library, you can
|
||||
add --disable-unicode to the "configure" command. This reduces the size of
|
||||
the libraries. It is not possible to configure one library with Unicode
|
||||
support, and another without, in the same configuration.
|
||||
. If you do not want to make use of the default support for UTF-8 Unicode
|
||||
character strings in the 8-bit library, UTF-16 Unicode character strings in
|
||||
the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
|
||||
library, you can add --disable-unicode to the "configure" command. This
|
||||
reduces the size of the libraries. It is not possible to configure one
|
||||
library with Unicode support, and another without, in the same configuration.
|
||||
It is also not possible to use --enable-ebcdic (see below) with Unicode
|
||||
support, so if this option is set, you must also use --disable-unicode.
|
||||
|
||||
When Unicode support is available, the use of a UTF encoding still has to be
|
||||
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
||||
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
||||
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
|
||||
not possible to use both --enable-unicode and --enable-ebcdic at the same
|
||||
time.
|
||||
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
|
||||
|
||||
As well as supporting UTF strings, Unicode support includes support for the
|
||||
\P, \p, and \X sequences that recognize Unicode character properties.
|
||||
@ -232,18 +231,18 @@ library. They are also documented in the pcre2build man page.
|
||||
--with-match-limit=500000
|
||||
|
||||
on the "configure" command. This is just the default; individual calls to
|
||||
pcre2_match() can supply their own value. There is more discussion on the
|
||||
pcre2api man page.
|
||||
pcre2_match() can supply their own value. There is more discussion in the
|
||||
pcre2api man page (search for pcre2_set_match_limit).
|
||||
|
||||
. There is a separate counter that limits the depth of recursive function calls
|
||||
during a matching process. This also has a default of ten million, which is
|
||||
essentially "unlimited". You can change the default by setting, for example,
|
||||
. There is a separate counter that limits the depth of nested backtracking
|
||||
during a matching process, which in turn limits the amount of memory that is
|
||||
used. This also has a default of ten million, which is essentially
|
||||
"unlimited". You can change the default by setting, for example,
|
||||
|
||||
--with-match-limit-recursion=500000
|
||||
--with-match-limit-depth=5000
|
||||
|
||||
Recursive function calls use up the runtime stack; running out of stack can
|
||||
cause programs to crash in strange ways. There is a discussion about stack
|
||||
sizes in the pcre2stack man page.
|
||||
There is more discussion in the pcre2api man page (search for
|
||||
pcre2_set_depth_limit).
|
||||
|
||||
. In the 8-bit library, the default maximum compiled pattern size is around
|
||||
64K bytes. You can increase this by adding --with-link-size=3 to the
|
||||
@ -254,20 +253,6 @@ library. They are also documented in the pcre2build man page.
|
||||
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
|
||||
link size setting is ignored, as 4-byte offsets are always used.
|
||||
|
||||
. You can build PCRE2 so that its internal match() function that is called from
|
||||
pcre2_match() does not call itself recursively. Instead, it uses memory
|
||||
blocks obtained from the heap to save data that would otherwise be saved on
|
||||
the stack. To build PCRE2 like this, use
|
||||
|
||||
--disable-stack-for-recursion
|
||||
|
||||
on the "configure" command. PCRE2 runs more slowly in this mode, but it may
|
||||
be necessary in environments with limited stack sizes. This applies only to
|
||||
the normal execution of the pcre2_match() function; if JIT support is being
|
||||
successfully used, it is not relevant. Equally, it does not apply to
|
||||
pcre2_dfa_match(), which does not use deeply nested recursion. There is a
|
||||
discussion about stack sizes in the pcre2stack man page.
|
||||
|
||||
. For speed, PCRE2 uses four tables for manipulating and identifying characters
|
||||
whose code point values are less than 256. By default, it uses a set of
|
||||
tables for ASCII encoding that is part of the distribution. If you specify
|
||||
@ -389,6 +374,13 @@ library. They are also documented in the pcre2build man page.
|
||||
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||
file are the test string.
|
||||
|
||||
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
|
||||
which caused pcre2_match() to use individual blocks on the heap for
|
||||
backtracking instead of recursive function calls (which use the stack). This
|
||||
is now obsolete since pcre2_match() was refactored always to use the heap (in
|
||||
a much more efficient way than before). This option is retained for backwards
|
||||
compatibility, but has no effect other than to output a warning.
|
||||
|
||||
The "configure" script builds the following files for the basic C library:
|
||||
|
||||
. Makefile the makefile that builds the library
|
||||
@ -662,25 +654,32 @@ Unicode support is enabled.
|
||||
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
|
||||
16-bit and 32-bit modes. These are tests that generate different output in
|
||||
8-bit mode. Each pair are for general cases and Unicode support, respectively.
|
||||
|
||||
Test 13 checks the handling of non-UTF characters greater than 255 by
|
||||
pcre2_dfa_match() in 16-bit and 32-bit modes.
|
||||
|
||||
Test 14 contains a number of tests that must not be run with JIT. They check,
|
||||
Test 14 contains some special UTF and UCP tests that give different output for
|
||||
the different widths.
|
||||
|
||||
Test 15 contains a number of tests that must not be run with JIT. They check,
|
||||
among other non-JIT things, the match-limiting features of the intepretive
|
||||
matcher.
|
||||
|
||||
Test 15 is run only when JIT support is not available. It checks that an
|
||||
Test 16 is run only when JIT support is not available. It checks that an
|
||||
attempt to use JIT has the expected behaviour.
|
||||
|
||||
Test 16 is run only when JIT support is available. It checks JIT complete and
|
||||
Test 17 is run only when JIT support is available. It checks JIT complete and
|
||||
partial modes, match-limiting under JIT, and other JIT-specific features.
|
||||
|
||||
Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to
|
||||
Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
|
||||
the 8-bit library, without and with Unicode support, respectively.
|
||||
|
||||
Test 19 checks the serialization functions by writing a set of compiled
|
||||
Test 20 checks the serialization functions by writing a set of compiled
|
||||
patterns to a file, and then reloading and checking them.
|
||||
|
||||
Tests 21 and 22 test \C support when the use of \C is not locked out, without
|
||||
and with UTF support, respectively. Test 23 tests \C when it is locked out.
|
||||
|
||||
|
||||
Character tables
|
||||
----------------
|
||||
@ -866,4 +865,4 @@ The distribution should contain the files listed below.
|
||||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 01 November 2016
|
||||
Last updated: 17 March 2017
|
||||
|
Loading…
Reference in New Issue
Block a user