Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>  the future of publishing at W3C

That is an amazing example.

It's not even "double UTF-8", it's UTF-8 six times (including the one to get it on the Web), it's been decoded as Latin-1 twice and Windows-1252 three times, and at the end there's a non-breaking space that's been converted to a space. All to represent what originated as a single non-breaking space anyway.

Which makes me happy that my module solves it.

    >>> from ftfy.fixes import fix_encoding_and_explain
    >>> fix_encoding_and_explain(" the future of publishing at W3C")
    ('\xa0the future of publishing at W3C',
     [('encode', 'sloppy-windows-1252', 0),
      ('transcode', 'restore_byte_a0', 2),
      ('decode', 'utf-8-variants', 0),
      ('encode', 'sloppy-windows-1252', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'latin-1', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'sloppy-windows-1252', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'latin-1', 0),
      ('decode', 'utf-8', 0)])


Hey, is there any way I could automate this kind of fix? It'd be awesome for web scraping.


Automating this fix is precisely what I'm showing off. And yes, it's damn useful for web scraping.

https://github.com/LuminosoInsight/python-ftfy


Neato! I wrote a shitty version of 50% of that two years ago, when I was tasked with uncooking a bunch of data in a MySQL database as part of a larger migration to UTF-8. I hadn't done that much pencil-and-paper bit manipulation since I was 13.


Awesome module! I wonder if anyone else had ever managed to reverse-engineer that tweet before.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: