To Home page

Canonicalizing Human Readable Identifiers

This is not an urgent problem, since we will only employ password-authenticated key agreement based on zero knowledge password proofs, considerably reducing the phishing attack surface. 

The C language library utf8proc does some rather basic canonicalization for identifiers, falling well short of the lengthy and complex prescriptions of the unicode consortium.

The unicode consortium has extensive documentation on confusable identifiers, plus source code for detecting them.  They are primarily worried about spoofing paypal. 

But for identiers, we convert the original UTF-8 string to Normalization Form C (NFC), and then has homoglyphs mapped to a single glyph with a context sensitive mapping tool that tries to choose the homoglyph that best suits the surrounding character script.

We apply the following rules to an identifier string: 

If these rules result in any changes, the rule set is reapplied until no further changes ensue. 

These documents are licensed under the Creative Commons Attribution-Share Alike 3.0 License