42 lines
3.1 KiB
HTML
42 lines
3.1 KiB
HTML
|
<!DOCTYPE html>
|
||
|
<html lang="en"><head>
|
||
|
|
||
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
||
|
|
||
|
<style>
|
||
|
body {
|
||
|
max-width: 30em;
|
||
|
margin-left: 2em;
|
||
|
}
|
||
|
p.center {text-align:center;}
|
||
|
</style>
|
||
|
<link rel="shortcut icon" href="../rho.ico">
|
||
|
<title>Canonicalizing Human Readable Identifiers</title> </head><body>
|
||
|
|
||
|
<p><a href="./index.html"> To Home page</a> </p>
|
||
|
|
||
|
<h1>Canonicalizing Human Readable Identifiers</h1><p>
|
||
|
|
||
|
This is not an urgent problem, since we will only employ password-authenticated key agreement based on zero knowledge password proofs, considerably reducing the phishing attack surface. </p><p>
|
||
|
|
||
|
The C language library <a href="http://www.flexiguided.de/publications.utf8proc.en.html">utf8proc</a> does some rather basic canonicalization for identifiers, falling well short of the lengthy and complex prescriptions of the unicode consortium.</p><p>
|
||
|
|
||
|
The unicode consortium has <a href="https://www.unicode.org/reports/tr39/#UTR36">extensive</a> <a href="https://www.unicode.org/reports/tr39/">documentation</a> on confusable identifiers, plus <a href="http://unicode.org/cldr/utility/confusables.jsp">source code for detecting them</a>. They are primarily worried about spoofing paypal. </p><p>
|
||
|
|
||
|
But for identiers, we convert the original UTF-8 string to <a href="https://www.unicode.org/reports/tr15/tr15-45.html">Normalization Form C (NFC)</a>, and <a href="https://www.unicode.org/reports/tr39/">then has homoglyphs mapped to a single glyph with a context sensitive mapping tool that tries to choose the homoglyph that best suits the surrounding character script.</a>. </p><p>
|
||
|
|
||
|
We apply the following rules to an identifier string: </p><ul><li>
|
||
|
Invisible characters removed. </li><li>
|
||
|
All of the innumerable different kinds of unicode whitepaces are mapped to a single white space. </li><li>
|
||
|
Leading and trailing whitespace removed. </li><li>
|
||
|
All strings end with a visible character followed by a terminator null. No invisible strings.</li><li>
|
||
|
Homoglyphs that look like a numeric are mapped to a numeric if followed or preceded by a numeric. </li><li>
|
||
|
Homoglyphs that look like a member of a script are mapped to that script if followed or preceded by a member of that script. Preceding characters take precedence, except that numerics take precedence over latin, and latin precedence over other scripts.</li><li>
|
||
|
Isolated homoglyphs, homoglyphs that do not look like any member of the scripts of the preceding or following characters, are mapped to the first script preceding them that provides a match, or, failing that, the first script following them that provides a match, or, failing that, to the homoglyph with the lowest numeric value, which is usually the most vanilla homoglyph. </li></ul><p>
|
||
|
If these rules result in any changes, the rule set is reapplied until no further changes ensue. </p>
|
||
|
<p style="background-color : #ccffcc; font-size:80%">These documents are
|
||
|
licensed under the <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative
|
||
|
Commons Attribution-Share Alike 3.0 License</a></p>
|
||
|
|
||
|
</body></html>
|
||
|
|