Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fowler's Law on Unicode: There's always another bug, you just haven't found it yet.

Dr Drang's script counts the number of _characters_ not the number of _glyphs_. This matters because there's more than one way to represent é: Either just as unicode character \x{e9} ("NFC") or as a combination of "e" and the combining character that adds the accent ("NFD")

For example for "léon" this prints out "l3n" for me.

What you need to do is normalize to NFC.

> /usr/bin/perl -C -MUnicode::Normalize -pe '$_=NFC($_);s/(.)(.+)(.)/$1 . length($2) . $3/e'



NFC isn't right, either: some letters don't have pre-composed forms. Imo, you need to pull in a whole glyph-counting algorithm.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: