Helpers to repair encodings (especially umlauts)
Project description
It is alarming, that very often, special characters like umlauts break when
converting through different encodings. (You might want to take a look at the
German Amazon Marketplace.)
A broken umlaut is still valid in the target encoding and therefore can only be
detect through heuristics (magic).
Version 0.5: supporting utf-8 and latin1
For a full changeset, take a look at bitbucket.org/niels_mfo/encoding_repair
(bug reports will also be accepted there)
A common case that breaks a special character is the following:
- An input string is coded in utf-8 (which uses multibyte chars)
- It is interpreted as being a valid latin1 string
- Latin1 has a valid representation for nearly all bytes
- Latin1 uses single-byte chars
- Now both bytes of the multi-byte char are interpreted as chars
- The special char broke off into two different (valid!) characters
This scenario has many pitfalls:
- The characters are irreversebly broken.
- ... regardless of what you do with the string.
- You can convert it through all encodings and the umlauts won't come back.
- Only through a few heuristical replaces, this module is able to help you.
This module assumes, that a few special characters are always correct. They are
stored in the list 'umlauts'. Furthermore, the module assumes, that their
representation, that would be correct in the other encoding, is always broken in
the target-encoding.
NOTE:
This only happens, because people don't use unicode. If everybody would
consequently use unicode strings, I would not have to write this module.
The best and actually only way to handle encodings correctly is the following:
- An input string comes into your programm.
- If it is unicode, jump to point 6.
- If it isn't, you might already need to repair umlauts.
- You need to make sure, that you know the right encoding of the input
string, because it is hardly possible to guess.
- Convert it to unicode.
- Use the unicode string throuout your whole programm.
- If you can return unicode, return unicode.
- If you are in doubt, return unicode.
- If you really need to return anything else, return utf-8.
- If you are certain, that the programm, which will take your output is not
able to handle neither unicode nor utf-8, you better write a bug report.
converting through different encodings. (You might want to take a look at the
German Amazon Marketplace.)
A broken umlaut is still valid in the target encoding and therefore can only be
detect through heuristics (magic).
Version 0.5: supporting utf-8 and latin1
For a full changeset, take a look at bitbucket.org/niels_mfo/encoding_repair
(bug reports will also be accepted there)
A common case that breaks a special character is the following:
- An input string is coded in utf-8 (which uses multibyte chars)
- It is interpreted as being a valid latin1 string
- Latin1 has a valid representation for nearly all bytes
- Latin1 uses single-byte chars
- Now both bytes of the multi-byte char are interpreted as chars
- The special char broke off into two different (valid!) characters
This scenario has many pitfalls:
- The characters are irreversebly broken.
- ... regardless of what you do with the string.
- You can convert it through all encodings and the umlauts won't come back.
- Only through a few heuristical replaces, this module is able to help you.
This module assumes, that a few special characters are always correct. They are
stored in the list 'umlauts'. Furthermore, the module assumes, that their
representation, that would be correct in the other encoding, is always broken in
the target-encoding.
NOTE:
This only happens, because people don't use unicode. If everybody would
consequently use unicode strings, I would not have to write this module.
The best and actually only way to handle encodings correctly is the following:
- An input string comes into your programm.
- If it is unicode, jump to point 6.
- If it isn't, you might already need to repair umlauts.
- You need to make sure, that you know the right encoding of the input
string, because it is hardly possible to guess.
- Convert it to unicode.
- Use the unicode string throuout your whole programm.
- If you can return unicode, return unicode.
- If you are in doubt, return unicode.
- If you really need to return anything else, return utf-8.
- If you are certain, that the programm, which will take your output is not
able to handle neither unicode nor utf-8, you better write a bug report.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file encoding_repair-0.5dev.tar.gz
.
File metadata
- Download URL: encoding_repair-0.5dev.tar.gz
- Upload date:
- Size: 3.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9a858275d6e7ea029121e6ffacda73b21e318e02c29615caa614a292435398c |
|
MD5 | 0f35173ec65e37e0bda21653bffdb0ac |
|
BLAKE2b-256 | 70abc1d39d27524a17f8240d00906f5430192fb64cb1944434bfd3fea7aa6d1f |