To Home page
Canonicalizing Human Readable Identifiers
This is not an urgent problem, since we will only employ password-authenticated key agreement based on zero knowledge password proofs, considerably reducing the phishing attack surface.
The C language library utf8proc does some rather basic canonicalization for identifiers, falling well short of the lengthy and complex prescriptions of the unicode consortium.
The unicode consortium has extensive documentation on confusable identifiers, plus source code for detecting them. They are primarily worried about spoofing paypal.
But for identiers, we convert the original UTF-8 string to Normalization Form C (NFC), and then has homoglyphs mapped to a single glyph with a context sensitive mapping tool that tries to choose the homoglyph that best suits the surrounding character script..
We apply the following rules to an identifier string:
-
Invisible characters removed.
-
All of the innumerable different kinds of unicode whitepaces are mapped to a single white space.
-
Leading and trailing whitespace removed.
-
All strings end with a visible character followed by a terminator null. No invisible strings.
-
Homoglyphs that look like a numeric are mapped to a numeric if followed or preceded by a numeric.
-
Homoglyphs that look like a member of a script are mapped to that script if followed or preceded by a member of that script. Preceding characters take precedence, except that numerics take precedence over latin, and latin precedence over other scripts.
-
Isolated homoglyphs, homoglyphs that do not look like any member of the scripts of the preceding or following characters, are mapped to the first script preceding them that provides a match, or, failing that, the first script following them that provides a match, or, failing that, to the homoglyph with the lowest numeric value, which is usually the most vanilla homoglyph.
If these rules result in any changes, the rule set is reapplied until no further changes ensue.
These documents are
licensed under the Creative
Commons Attribution-Share Alike 3.0 License