Documentation update for Script Extensions property coding.

This commit is contained in:
ph10 2018-10-07 16:29:51 +00:00
parent 4cdcd57030
commit a0c94a313c
2 changed files with 158 additions and 146 deletions

View File

@ -8,26 +8,28 @@
# the upgrading of Unicode property support. The new code speeds up property
# matching many times. The script is for the use of PCRE maintainers, to
# generate the pcre2_ucd.c file that contains a digested form of the Unicode
# data tables.
# data tables. A number of extensions have been added to the original script.
#
# The script has now been upgraded to Python 3 for PCRE2, and should be run in
# the maint subdirectory, using the command
#
# [python3] ./MultiStage2.py >../src/pcre2_ucd.c
#
# It requires five Unicode data tables: DerivedGeneralCategory.txt,
# GraphemeBreakProperty.txt, Scripts.txt, CaseFolding.txt, and emoji-data.txt.
# These must be in the maint/Unicode.tables subdirectory.
# It requires six Unicode data tables: DerivedGeneralCategory.txt,
# GraphemeBreakProperty.txt, Scripts.txt, ScriptExtensions.txt,
# CaseFolding.txt, and emoji-data.txt. These must be in the
# maint/Unicode.tables subdirectory.
#
# DerivedGeneralCategory.txt is found in the "extracted" subdirectory of the
# Unicode database (UCD) on the Unicode web site; GraphemeBreakProperty.txt is
# in the "auxiliary" subdirectory. Scripts.txt and CaseFolding.txt are directly
# in the UCD directory. The emoji-data.txt file is in files associated with
# Unicode Technical Standard #51 ("Unicode Emoji"), for example:
#
# http://unicode.org/Public/emoji/11.0/emoji-data.txt
# in the "auxiliary" subdirectory. Scripts.txt, ScriptExtensions.txt, and
# CaseFolding.txt are directly in the UCD directory. The emoji-data.txt file is
# in files associated with Unicode Technical Standard #51 ("Unicode Emoji"),
# for example:
#
# http://unicode.org/Public/emoji/11.0/emoji-data.txt
#
# -----------------------------------------------------------------------------
# Minor modifications made to this script:
# Added #! line at start
# Removed tabs
@ -61,78 +63,8 @@
# property, which is used by PCRE2 as a grapheme breaking property. This was
# done when updating to Unicode 11.0.0 (July 2018).
#
# Added code to add a Script Extensions field to records.
#
#
# The main tables generated by this script are used by macros defined in
# pcre2_internal.h. They look up Unicode character properties using short
# sequences of code that contains no branches, which makes for greater speed.
#
# Conceptually, there is a table of records (of type ucd_record), containing a
# script number, script extension value, character type, grapheme break type,
# offset to caseless matching set, offset to the character's other case, for
# every character. However, a real table covering all Unicode characters would
# be far too big. It can be efficiently compressed by observing that many
# characters have the same record, and many blocks of characters (taking 128
# characters in a block) have the same set of records as other blocks. This
# leads to a 2-stage lookup process.
#
# This script constructs six tables. The ucd_caseless_sets table contains
# lists of characters that all match each other caselessly. Each list is
# in order, and is terminated by NOTACHAR (0xffffffff), which is larger than
# any valid character. The first list is empty; this is used for characters
# that are not part of any list.
#
# The ucd_digit_sets table contains the code points of the '9' characters in
# each set of 10 decimal digits in Unicode. This is used to ensure that digits
# in script runs all come from the same set. The first element in the vector
# contains the number of subsequent elements, which are in ascending order.
#
# The ucd_script_sets vector contains lists of script numbers that are the
# Script Extensions properties of certain characters. Each list is terminated
# by zero (ucp_Unknown). A character with more than one script listed for its
# Script Extension property has a negative value in its record. This is the
# negated offset to the start of the relevant list.
#
# The ucd_records table contains one instance of every unique record that is
# required. The ucd_stage1 table is indexed by a character's block number, and
# yields what is in effect a "virtual" block number. The ucd_stage2 table is a
# table of "virtual" blocks; each block is indexed by the offset of a character
# within its own block, and the result is the offset of the required record.
#
# The following examples are correct for the Unicode 11.0.0 database. Future
# updates may make change the actual lookup values.
#
# Example: lowercase "a" (U+0061) is in block 0
# lookup 0 in stage1 table yields 0
# lookup 97 in the first table in stage2 yields 16
# record 17 is { 33, 5, 11, 0, -32 }
# 33 = ucp_Latin => Latin script
# 5 = ucp_Ll => Lower case letter
# 12 = ucp_gbOther => Grapheme break property "Other"
# 0 => not part of a caseless set
# -32 => Other case is U+0041
#
# Almost all lowercase latin characters resolve to the same record. One or two
# are different because they are part of a multi-character caseless set (for
# example, k, K and the Kelvin symbol are such a set).
#
# Example: hiragana letter A (U+3042) is in block 96 (0x60)
# lookup 96 in stage1 table yields 90
# lookup 66 in the 90th table in stage2 yields 515
# record 515 is { 26, 7, 11, 0, 0 }
# 26 = ucp_Hiragana => Hiragana script
# 7 = ucp_Lo => Other letter
# 12 = ucp_gbOther => Grapheme break property "Other"
# 0 => not part of a caseless set
# 0 => No other case
#
# In these examples, no other blocks resolve to the same "virtual" block, as it
# happens, but plenty of other blocks do share "virtual" blocks.
#
# Philip Hazel, 03 July 2008
# Last Updated: 03 October 2018
#
# Added code to add a Script Extensions field to records. This has increased
# their size from 8 to 12 bytes, only 10 of which are currently used.
#
# 01-March-2010: Updated list of scripts for Unicode 5.2.0
# 30-April-2011: Updated list of scripts for Unicode 6.0.0
@ -155,6 +87,98 @@
# Pictographic property.
# 01-October-2018: Added the 'Unknown' script name
# 03-October-2018: Added new field for Script Extensions
# ----------------------------------------------------------------------------
#
#
# The main tables generated by this script are used by macros defined in
# pcre2_internal.h. They look up Unicode character properties using short
# sequences of code that contains no branches, which makes for greater speed.
#
# Conceptually, there is a table of records (of type ucd_record), containing a
# script number, script extension value, character type, grapheme break type,
# offset to caseless matching set, offset to the character's other case, for
# every Unicode character. However, a real table covering all Unicode
# characters would be far too big. It can be efficiently compressed by
# observing that many characters have the same record, and many blocks of
# characters (taking 128 characters in a block) have the same set of records as
# other blocks. This leads to a 2-stage lookup process.
#
# This script constructs six tables. The ucd_caseless_sets table contains
# lists of characters that all match each other caselessly. Each list is
# in order, and is terminated by NOTACHAR (0xffffffff), which is larger than
# any valid character. The first list is empty; this is used for characters
# that are not part of any list.
#
# The ucd_digit_sets table contains the code points of the '9' characters in
# each set of 10 decimal digits in Unicode. This is used to ensure that digits
# in script runs all come from the same set. The first element in the vector
# contains the number of subsequent elements, which are in ascending order.
#
# The ucd_script_sets vector contains lists of script numbers that are the
# Script Extensions properties of certain characters. Each list is terminated
# by zero (ucp_Unknown). A character with more than one script listed for its
# Script Extension property has a negative value in its record. This is the
# negated offset to the start of the relevant list in the ucd_script_sets
# vector.
#
# The ucd_records table contains one instance of every unique record that is
# required. The ucd_stage1 table is indexed by a character's block number,
# which is the character's code point divided by 128, since 128 is the size
# of each block. The result of a lookup in ucd_stage1 a "virtual" block number.
#
# The ucd_stage2 table is a table of "virtual" blocks; each block is indexed by
# the offset of a character within its own block, and the result is the index
# number of the required record in the ucd_records vector.
#
# The following examples are correct for the Unicode 11.0.0 database. Future
# updates may make change the actual lookup values.
#
# Example: lowercase "a" (U+0061) is in block 0
# lookup 0 in stage1 table yields 0
# lookup 97 (0x61) in the first table in stage2 yields 17
# record 17 is { 34, 5, 12, 0, -32, 34, 0 }
# 34 = ucp_Latin => Latin script
# 5 = ucp_Ll => Lower case letter
# 12 = ucp_gbOther => Grapheme break property "Other"
# 0 => Not part of a caseless set
# -32 (-0x20) => Other case is U+0041
# 34 = ucp_Latin => No special Script Extension property
# 0 => Dummy value, unused at present
#
# Almost all lowercase latin characters resolve to the same record. One or two
# are different because they are part of a multi-character caseless set (for
# example, k, K and the Kelvin symbol are such a set).
#
# Example: hiragana letter A (U+3042) is in block 96 (0x60)
# lookup 96 in stage1 table yields 90
# lookup 66 (0x42) in table 90 in stage2 yields 564
# record 564 is { 27, 7, 12, 0, 0, 27, 0 }
# 27 = ucp_Hiragana => Hiragana script
# 7 = ucp_Lo => Other letter
# 12 = ucp_gbOther => Grapheme break property "Other"
# 0 => Not part of a caseless set
# 0 => No other case
# 27 = ucp_Hiragana => No special Script Extension property
# 0 => Dummy value, unused at present
#
# Example: vedic tone karshana (U+1CD0) is in block 57 (0x39)
# lookup 57 in stage1 table yields 55
# lookup 80 (0x50) in table 55 in stage2 yields 458
# record 458 is { 28, 12, 3, 0, 0, -101, 0 }
# 28 = ucp_Inherited => Script inherited from predecessor
# 12 = ucp_Mn => Non-spacing mark
# 3 = ucp_gbExtend => Grapheme break property "Extend"
# 0 => Not part of a caseless set
# 0 => No other case
# -101 => Script Extension list offset = 101
# 0 => Dummy value, unused at present
#
# At offset 101 in the ucd_script_sets vector we find the list 3, 15, 107, 29,
# and terminator 0. This means that this character is expected to be used with
# any of those scripts, which are Bengali, Devanagari, Grantha, and Kannada.
#
# Philip Hazel, 03 July 2008
# Last Updated: 07 October 2018
##############################################################################
@ -175,13 +199,13 @@ def get_other_case(chardata):
if chardata[1] == 'C' or chardata[1] == 'S':
return int(chardata[2], 16) - int(chardata[0], 16)
return 0
# Parse a line of ScriptExtensions.txt
def get_script_extension(chardata):
this_script_list = list(chardata[1].split(' '))
if len(this_script_list) == 1:
return script_abbrevs.index(this_script_list[0])
script_numbers = []
for d in this_script_list:
script_numbers.append(script_abbrevs.index(d))
@ -190,18 +214,18 @@ def get_script_extension(chardata):
for i in range(1, len(script_lists) - script_numbers_length + 1):
for j in range(0, script_numbers_length):
found = True
found = True
if script_lists[i+j] != script_numbers[j]:
found = False
found = False
break
if found:
return -i
# Not found in existing lists
# Not found in existing lists
return_value = len(script_lists)
script_lists.extend(script_numbers)
return -return_value
return -return_value
# Read the whole table in memory, setting/checking the Unicode version
def read_table(file_name, get_value, default_value):
@ -402,7 +426,7 @@ script_names = ['Unknown', 'Arabic', 'Armenian', 'Bengali', 'Bopomofo', 'Braille
'Dogra', 'Gunjala_Gondi', 'Hanifi_Rohingya', 'Makasar', 'Medefaidrin',
'Old_Sogdian', 'Sogdian'
]
script_abbrevs = [
'Zzzz', 'Arab', 'Armn', 'Beng', 'Bopo', 'Brai', 'Bugi', 'Buhd', 'Cans',
'Cher', 'Zyyy', 'Copt', 'Cprt', 'Cyrl', 'Dsrt', 'Deva', 'Ethi', 'Geor',
@ -434,7 +458,7 @@ script_abbrevs = [
'Zanb',
#New for Unicode 11.0.0
'Dogr', 'Gong', 'Rohg', 'Maka', 'Medf', 'Sogo', 'Sogd'
]
]
category_names = ['Cc', 'Cf', 'Cn', 'Co', 'Cs', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu',
'Mc', 'Me', 'Mn', 'Nd', 'Nl', 'No', 'Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps',
@ -499,10 +523,10 @@ scriptx = read_table('Unicode.tables/ScriptExtensions.txt', get_script_extension
for i in range(0, MAX_UNICODE):
if scriptx[i] == script_abbrevs_default:
scriptx[i] = script[i]
scriptx[i] = script[i]
# With the addition of the new Script Extensions field, we need some padding
# to get the Unicode records up to 12 bytes (multiple of 4). Set a value
# With the addition of the new Script Extensions field, we need some padding
# to get the Unicode records up to 12 bytes (multiple of 4). Set a value
# greater than 255 to make the field 16 bits.
padding_dummy = [0] * MAX_UNICODE
@ -690,11 +714,11 @@ for line in file:
m = re.match(r'([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)\s+;\s+\S+\s+#\s+Nd\s+', line)
if m is None:
continue
first = int(m.group(1),16)
last = int(m.group(2),16)
first = int(m.group(1),16)
last = int(m.group(2),16)
if ((last - first + 1) % 10) != 0:
print("ERROR: %04x..%04x does not contain a multiple of 10 characters" % (first, last),
file=sys.stderr)
file=sys.stderr)
while first < last:
digitsets.append(first + 9)
first += 10
@ -724,9 +748,9 @@ count = 0
print(" /* 0 */", end='')
for d in script_lists:
print(" %3d," % d, end='')
count += 1
count += 1
if d == 0:
print("\n /* %3d */" % count, end='')
print("\n /* %3d */" % count, end='')
print("\n};\n")
# Output the main UCD tables.

View File

@ -23,11 +23,12 @@ GenerateUtt.py A Python script to generate part of the pcre2_tables.c file
ManyConfigTests A shell script that runs "configure, make, test" a number of
times with different configuration settings.
MultiStage2.py A Python script that generates the file pcre2_ucd.c from five
Unicode data tables, which are themselves downloaded from the
MultiStage2.py A Python script that generates the file pcre2_ucd.c from six
Unicode data files, which are themselves downloaded from the
Unicode web site. Run this script in the "maint" directory.
The generated file contains the tables for a 2-stage lookup
of Unicode properties.
The generated file is written to stdout. It contains the
tables for a 2-stage lookup of Unicode properties, along with
some auxiliary tables.
pcre2_chartables.c.non-standard
This is a set of character tables that came from a Windows
@ -40,14 +41,15 @@ README This file.
Unicode.tables The files in this directory were downloaded from the Unicode
web site. They contain information about Unicode characters
and scripts. The ones used by the MultiStage2.py script are
CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
GraphemeBreakProperty.txt, and emoji-data.txt. I've kept
UnicodeData.txt (which is no longer used by the script)
because it is useful occasionally for manually looking up the
details of certain characters. However, note that character
names in this file such as "Arabic sign sanah" do NOT mean
that the character is in a particular script (in this case,
Arabic). Scripts.txt is where to look for script information.
CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
ScriptExtensions.txt, GraphemeBreakProperty.txt, and
emoji-data.txt. I've kept UnicodeData.txt (which is no longer
used by the script) because it is useful occasionally for
manually looking up the details of certain characters.
However, note that character names in this file such as
"Arabic sign sanah" do NOT mean that the character is in a
particular script (in this case, Arabic). Scripts.txt and
ScriptExtensions.txt are where to look for script information.
ucptest.c A short C program for testing the Unicode property macros
that do lookups in the pcre2_ucd.c data, mainly useful after
@ -61,7 +63,7 @@ utf8.c A short, freestanding C program for converting a Unicode code
point into a sequence of bytes in the UTF-8 encoding, and vice
versa. If its argument is a hex number such as 0x1234, it
outputs a list of the equivalent UTF-8 bytes. If its argument
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it
is a sequence of concatenated UTF-8 bytes (e.g. e188b4) it
treats them as a UTF-8 character and outputs the equivalent
code point in hex.
@ -72,25 +74,31 @@ Updating to a new Unicode release
When there is a new release of Unicode, the files in Unicode.tables must be
refreshed from the web site. If the new version of Unicode adds new character
scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the
GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be
run to generate the tricky tables for inclusion in pcre2_tables.c.
GenerateUtt.py scripts must be edited to add the new names. I have been adding
each new group at the end of the relevant list, with a comment. Note also that
both the pcre2syntax.3 and pcre2pattern.3 man pages contain lists of Unicode
script names.
If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
the cause is usually a missing (or misspelt) name in the list of scripts. I
couldn't find a straightforward list of scripts on the Unicode site, but
there's a useful Wikipedia page that lists them, and notes the Unicode version
in which they were introduced:
MultiStage2.py has two lists: the full names and the abbreviations that are
found in the ScriptExtensions.txt file. A list of script names and their
abbreviations s can be found in the PropertyValueAliases.txt file on the
Unicode web site. There is also a Wikipedia page that lists them, and notes the
Unicode version in which they were introduced:
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
Once the script name lists have been updated, MultiStage2.py can be run to
generate a new version of pcre2_ucd.c, and GenerateUtt.py can be run to
generate the tricky tables for inclusion in pcre2_tables.c (which must be
hand-edited). If MultiStage2.py gives the error "ValueError: list.index(x): x
not in list", the cause is usually a missing (or misspelt) name in one of the
lists of scripts.
The ucptest program can be compiled and used to check that the new tables in
pcre2_ucd.c work properly, using the data files in ucptestdata to check a
number of test characters. The source file ucptest.c must be updated whenever
new Unicode script names are added.
Note also that both the pcre2syntax.3 and pcre2pattern.3 man pages contain
lists of Unicode script names.
number of test characters. The source file ucptest.c should also be updated
whenever new Unicode script names are added, and adding a few tests for new
scripts is a good idea.
Preparing for a PCRE2 release
@ -401,26 +409,6 @@ very sensible; some are rather wacky. Some have been on this list for years.
strings, at least one of which must be present for a match, efficient
pre-searching of large datasets could be implemented.
. There's a Perl proposal for some new (* things, including alpha synonyms for
the lookaround assertions:
(*pla: …)
(*plb: …)
(*nla: …)
(*nlb: …)
(*atomic: …)
(*positive_look_ahead:...)
(*negative_look_ahead:...)
(*positive_look_behind:...)
(*negative_look_behind:...)
Also a new one (with synonyms):
(*script_run: ...) Ensure all captured chars are in the same script
(*sr: …)
(*atomic_script_run: …) A combination of script_run and atomic
(*asr:...)
. If pcre2grep had --first-line (match only in the first line) it could be
efficiently used to find files "starting with xxx". What about --last-line?
@ -441,4 +429,4 @@ very sensible; some are rather wacky. Some have been on this list for years.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 21 August 2018
Last updated: 07 October 2018