wallet/docs/canonicalizing_human_readable_identifiers.html

<!DOCTYPE html>
<html lang="en"><head>

  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

	<style>
		body {
			max-width: 30em;
			margin-left: 2em;
			}
		p.center {text-align:center;}
	</style>
	<link rel="shortcut icon" href="../rho.ico">
	<title>Canonicalizing Human Readable Identifiers</title> </head><body>

  <p><a href="./index.html"> To Home page</a> </p>

<h1>Canonicalizing Human Readable Identifiers</h1><p>

This is not an urgent problem, since we will only employ password-authenticated key agreement based on zero knowledge password proofs, considerably reducing the phishing attack surface.&nbsp; </p><p>

The C language library <a href="http://www.flexiguided.de/publications.utf8proc.en.html">utf8proc</a> does some rather basic canonicalization for identifiers, falling well short of the lengthy and complex prescriptions of the unicode consortium.</p><p>

The unicode consortium has <a href="https://www.unicode.org/reports/tr39/#UTR36">extensive</a> <a href="https://www.unicode.org/reports/tr39/">documentation</a> on confusable identifiers, plus <a href="http://unicode.org/cldr/utility/confusables.jsp">source code for detecting them</a>.&nbsp; They are primarily worried about spoofing paypal.&nbsp; </p><p>

But for identiers, we convert the original UTF-8 string to <a href="https://www.unicode.org/reports/tr15/tr15-45.html">Normalization Form C (NFC)</a>, and <a href="https://www.unicode.org/reports/tr39/">then has homoglyphs mapped to a single glyph with a context sensitive mapping tool that tries to choose the homoglyph that best suits the surrounding character script.</a>.&nbsp;</p><p>

We apply the following rules to an identifier string:&nbsp; </p><ul><li>
Invisible characters removed.&nbsp; </li><li>
All of the innumerable different kinds of unicode whitepaces are mapped to a single white space.&nbsp; </li><li>
Leading and trailing whitespace removed.&nbsp; </li><li>
All strings end with a visible character followed by a terminator null.&nbsp; No invisible strings.</li><li>
Homoglyphs that look like a numeric are mapped to a numeric if followed or preceded by a numeric.&nbsp; </li><li>
Homoglyphs that look like a member of a script are mapped to that script if followed or preceded by a member of that script.&nbsp; Preceding characters take precedence, except that numerics take precedence over latin, and latin precedence over other scripts.</li><li>
Isolated homoglyphs, homoglyphs that do not look like any member of the scripts of the preceding or following characters, are mapped to the first script preceding them that provides a match, or, failing that, the first script following them that provides a match, or, failing that, to the homoglyph with the lowest numeric value, which is usually the most vanilla homoglyph.&nbsp; </li></ul><p>
If these rules result in any changes, the rule set is reapplied until no further changes ensue.&nbsp; </p>
<p style="background-color : #ccffcc;  font-size:80%">These documents are
licensed under the <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative
Commons Attribution-Share Alike 3.0 License</a></p>

 </body></html>