>  the future of publishing at W3C
That is an amazing example.
It's not even "double UTF-8", it's UTF-8 six times (including the one to get it on the Web), it's been decoded as Latin-1 twice and Windows-1252 three times, and at the end there's a non-breaking space that's been converted to a space. All to represent what originated as a single non-breaking space anyway.
Which makes me happy that my module solves it.
>>> from ftfy.fixes import fix_encoding_and_explain
>>> fix_encoding_and_explain(" the future of publishing at W3C")
('\xa0the future of publishing at W3C',
[('encode', 'sloppy-windows-1252', 0),
('transcode', 'restore_byte_a0', 2),
('decode', 'utf-8-variants', 0),
('encode', 'sloppy-windows-1252', 0),
('decode', 'utf-8', 0),
('encode', 'latin-1', 0),
('decode', 'utf-8', 0),
('encode', 'sloppy-windows-1252', 0),
('decode', 'utf-8', 0),
('encode', 'latin-1', 0),
('decode', 'utf-8', 0)])
Neato! I wrote a shitty version of 50% of that two years ago, when I was tasked with uncooking a bunch of data in a MySQL database as part of a larger migration to UTF-8. I hadn't done that much pencil-and-paper bit manipulation since I was 13.
That is an amazing example.
It's not even "double UTF-8", it's UTF-8 six times (including the one to get it on the Web), it's been decoded as Latin-1 twice and Windows-1252 three times, and at the end there's a non-breaking space that's been converted to a space. All to represent what originated as a single non-breaking space anyway.
Which makes me happy that my module solves it.