encoding_repair

Helpers to repair encodings (especially umlauts)

Project description

It is alarming, that very often, special characters like umlauts break when
converting through different encodings. (You might want to take a look at the
German Amazon Marketplace.)
A broken umlaut is still valid in the target encoding and therefore can only be
detect through heuristics (magic).

Version 0.5: supporting utf-8 and latin1
For a full changeset, take a look at bitbucket.org/niels_mfo/encoding_repair
(bug reports will also be accepted there)

A common case that breaks a special character is the following:
- An input string is coded in utf-8 (which uses multibyte chars)
- It is interpreted as being a valid latin1 string
- Latin1 has a valid representation for nearly all bytes
- Latin1 uses single-byte chars
- Now both bytes of the multi-byte char are interpreted as chars
- The special char broke off into two different (valid!) characters

This scenario has many pitfalls:
- The characters are irreversebly broken.
- ... regardless of what you do with the string.
- You can convert it through all encodings and the umlauts won't come back.
- Only through a few heuristical replaces, this module is able to help you.

This module assumes, that a few special characters are always correct. They are
stored in the list 'umlauts'. Furthermore, the module assumes, that their
representation, that would be correct in the other encoding, is always broken in
the target-encoding.

NOTE:
This only happens, because people don't use unicode. If everybody would
consequently use unicode strings, I would not have to write this module.
The best and actually only way to handle encodings correctly is the following:
- An input string comes into your programm.
- If it is unicode, jump to point 6.
- If it isn't, you might already need to repair umlauts.
- You need to make sure, that you know the right encoding of the input
string, because it is hardly possible to guess.
- Convert it to unicode.
- Use the unicode string throuout your whole programm.
- If you can return unicode, return unicode.
- If you are in doubt, return unicode.
- If you really need to return anything else, return utf-8.
- If you are certain, that the programm, which will take your output is not
able to handle neither unicode nor utf-8, you better write a bug report.

Project details

Release history Release notifications | RSS feed

0.7dev pre-release

Aug 31, 2011

0.6dev pre-release

Aug 25, 2011

This version

0.5dev pre-release

Aug 25, 2011

0.3dev pre-release

Aug 25, 2011

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

encoding_repair-0.5dev.tar.gz (3.6 kB view details)

Uploaded Aug 25, 2011 Source

File details

Details for the file encoding_repair-0.5dev.tar.gz.

File metadata

Download URL: encoding_repair-0.5dev.tar.gz
Upload date: Aug 25, 2011
Size: 3.6 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for encoding_repair-0.5dev.tar.gz
Algorithm	Hash digest
SHA256	`d9a858275d6e7ea029121e6ffacda73b21e318e02c29615caa614a292435398c`
MD5	`0f35173ec65e37e0bda21653bffdb0ac`
BLAKE2b-256	`70abc1d39d27524a17f8240d00906f5430192fb64cb1944434bfd3fea7aa6d1f`