<h1>Canonicalizing Human Readable Identifiers</h1><p>
This is not an urgent problem, since we will only employ password-authenticated key agreement based on zero knowledge password proofs, considerably reducing the phishing attack surface. </p><p>
The C language library <ahref="http://www.flexiguided.de/publications.utf8proc.en.html">utf8proc</a> does some rather basic canonicalization for identifiers, falling well short of the lengthy and complex prescriptions of the unicode consortium.</p><p>
The unicode consortium has <ahref="https://www.unicode.org/reports/tr39/#UTR36">extensive</a><ahref="https://www.unicode.org/reports/tr39/">documentation</a> on confusable identifiers, plus <ahref="http://unicode.org/cldr/utility/confusables.jsp">source code for detecting them</a>. They are primarily worried about spoofing paypal. </p><p>
But for identiers, we convert the original UTF-8 string to <ahref="https://www.unicode.org/reports/tr15/tr15-45.html">Normalization Form C (NFC)</a>, and <ahref="https://www.unicode.org/reports/tr39/">then has homoglyphs mapped to a single glyph with a context sensitive mapping tool that tries to choose the homoglyph that best suits the surrounding character script.</a>. </p><p>
We apply the following rules to an identifier string: </p><ul><li>
Invisible characters removed. </li><li>
All of the innumerable different kinds of unicode whitepaces are mapped to a single white space. </li><li>
Leading and trailing whitespace removed. </li><li>
All strings end with a visible character followed by a terminator null. No invisible strings.</li><li>
Homoglyphs that look like a numeric are mapped to a numeric if followed or preceded by a numeric. </li><li>
Homoglyphs that look like a member of a script are mapped to that script if followed or preceded by a member of that script. Preceding characters take precedence, except that numerics take precedence over latin, and latin precedence over other scripts.</li><li>
Isolated homoglyphs, homoglyphs that do not look like any member of the scripts of the preceding or following characters, are mapped to the first script preceding them that provides a match, or, failing that, the first script following them that provides a match, or, failing that, to the homoglyph with the lowest numeric value, which is usually the most vanilla homoglyph. </li></ul><p>
If these rules result in any changes, the rule set is reapplied until no further changes ensue. </p>
<pstyle="background-color : #ccffcc; font-size:80%">These documents are
licensed under the <arel="license"href="http://creativecommons.org/licenses/by-sa/3.0/">Creative