To Home page

Normalizing unicode strings

I would like strings that look similar to humans to map to the same item. Obviously trailing and leading whitespace needs to go, and whitespace map a single space.

The hard part, however is that unicode has an enormous number of near duplicate symbols.

Have you already read
https://www.unicode.org/reports/tr15/tr15-45.html ?

Our normalization code is in
http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=tree;f=libraries/liblunicode;h=4896a6dc9ee5d3e78c15ed6c2e2ed2f21be70247;hb=HEAD

I am going to have to use NFKC canonical form for the key, and NFC canonical form for the display of the key.

Which once in a blue moon will drive someone crazy. "Its broken" he will say

These documents are licensed under the Creative Commons Attribution-Share Alike 3.0 License