Tools to convert bytes content into unicode.
Project description
Unicodec Package Documentation
This package provides functions for:
- decoding bytes content of HTML document into Unicode text
- detecting encoding of bytes content of HTML document
- normalization of encoding's name to canonical form, according to WHATWG HTML standard
Feel free to give feedback in Telegram groups: @grablab and @grablab_ru.
Installation
pip install -U unicodec
Usage Example #1
Download web document with urllib and convert its content to Unicode.
from urllib.request import urlopen
from unicodec import decode_content, detect_content_encoding
res = urlopen("http://lib.ru")
rawdata = res.read()
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
Output:
<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r
Usage Example #2
Download web document with urllib3 and convert its content to Unicode.
from urllib3 import PoolManager
from unicodec import decode_content, detect_content_encoding
res = PoolManager().urlopen("GET", "http://lib.ru")
rawdata = res.data
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
Output:
<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r
Usage Example #3
Convert names of encodings to canonical form (according to WHATWG HTML standard).
from unicodec.normalization import normalize_encoding_name
for name in ["iso8859-1", "utf8", "cp1251"]:
print("{} -> {}".format(name, normalize_encoding_name(name)))
Output:
iso8859-1 -> windows-1252
utf8 -> utf-8
cp1251 -> windows-1251
References
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
unicodec-0.0.6.tar.gz
(9.3 kB
view details)
File details
Details for the file unicodec-0.0.6.tar.gz
.
File metadata
- Download URL: unicodec-0.0.6.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 342af597782ad3330b8614f89faed94c95c4b7e4a6c71d76a1627e97c17e0657 |
|
MD5 | 48b7533aa336453af3392c650f8025db |
|
BLAKE2b-256 | 959c9252e157f798d3e0ac2624396cf4989f7e54f2187570aec63730d88f8a5d |