1
0
forked from cheng/wallet
wallet/docs/canonicalizing_human_readable_identifiers.html
reaction.la 5238cda077
cleanup, and just do not like pdfs
Also, needed to understand Byzantine fault tolerant paxos better.

Still do not.
2022-02-20 18:26:44 +10:00

42 lines
3.1 KiB
HTML

<!DOCTYPE html>
<html lang="en"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style>
body {
max-width: 30em;
margin-left: 2em;
}
p.center {text-align:center;}
</style>
<link rel="shortcut icon" href="../rho.ico">
<title>Canonicalizing Human Readable Identifiers</title> </head><body>
<p><a href="./index.html"> To Home page</a> </p>
<h1>Canonicalizing Human Readable Identifiers</h1><p>
This is not an urgent problem, since we will only employ password-authenticated key agreement based on zero knowledge password proofs, considerably reducing the phishing attack surface.&nbsp; </p><p>
The C language library <a href="http://www.flexiguided.de/publications.utf8proc.en.html">utf8proc</a> does some rather basic canonicalization for identifiers, falling well short of the lengthy and complex prescriptions of the unicode consortium.</p><p>
The unicode consortium has <a href="https://www.unicode.org/reports/tr39/#UTR36">extensive</a> <a href="https://www.unicode.org/reports/tr39/">documentation</a> on confusable identifiers, plus <a href="http://unicode.org/cldr/utility/confusables.jsp">source code for detecting them</a>.&nbsp; They are primarily worried about spoofing paypal.&nbsp; </p><p>
But for identiers, we convert the original UTF-8 string to <a href="https://www.unicode.org/reports/tr15/tr15-45.html">Normalization Form C (NFC)</a>, and <a href="https://www.unicode.org/reports/tr39/">then has homoglyphs mapped to a single glyph with a context sensitive mapping tool that tries to choose the homoglyph that best suits the surrounding character script.</a>.&nbsp;</p><p>
We apply the following rules to an identifier string:&nbsp; </p><ul><li>
Invisible characters removed.&nbsp; </li><li>
All of the innumerable different kinds of unicode whitepaces are mapped to a single white space.&nbsp; </li><li>
Leading and trailing whitespace removed.&nbsp; </li><li>
All strings end with a visible character followed by a terminator null.&nbsp; No invisible strings.</li><li>
Homoglyphs that look like a numeric are mapped to a numeric if followed or preceded by a numeric.&nbsp; </li><li>
Homoglyphs that look like a member of a script are mapped to that script if followed or preceded by a member of that script.&nbsp; Preceding characters take precedence, except that numerics take precedence over latin, and latin precedence over other scripts.</li><li>
Isolated homoglyphs, homoglyphs that do not look like any member of the scripts of the preceding or following characters, are mapped to the first script preceding them that provides a match, or, failing that, the first script following them that provides a match, or, failing that, to the homoglyph with the lowest numeric value, which is usually the most vanilla homoglyph.&nbsp; </li></ul><p>
If these rules result in any changes, the rule set is reapplied until no further changes ensue.&nbsp; </p>
<p style="background-color : #ccffcc; font-size:80%">These documents are
licensed under the <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative
Commons Attribution-Share Alike 3.0 License</a></p>
</body></html>