Tools to convert bytes content into unicode.
Project description
Unicodec Package Documentation
This package provides functions for:
- decoding bytes content of HTML document into Unicode text
- detecting encoding of bytes content of HTML document
- normalization of encoding's name to canonical form, according to WHATWG HTML standard
Feel free to give feedback in Telegram groups: @grablab and @grablab_ru.
Installation
pip install -U unicodec
Usage Example #1
Download web document with urllib and convert its content to Unicode.
from urllib.request import urlopen
from unicodec import decode_content, detect_content_encoding
res = urlopen("http://lib.ru")
rawdata = res.read()
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
Output:
<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r
Usage Example #2
Download web document with urllib3 and convert its content to Unicode.
from urllib3 import PoolManager
from unicodec import decode_content, detect_content_encoding
res = PoolManager().urlopen("GET", "http://lib.ru")
rawdata = res.data
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
Output:
<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r
Usage Example #3
Convert names of encodings to canonical form (according to WHATWG HTML standard).
from unicodec.normalization import normalize_encoding_name
for name in ["iso8859-1", "utf8", "cp1251"]:
print("{} -> {}".format(name, normalize_encoding_name(name)))
Output:
iso8859-1 -> windows-1252
utf8 -> utf-8
cp1251 -> windows-1251
References
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
unicodec-0.0.7.tar.gz
(9.4 kB
view details)
File details
Details for the file unicodec-0.0.7.tar.gz
.
File metadata
- Download URL: unicodec-0.0.7.tar.gz
- Upload date:
- Size: 9.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31aa3781ec082cba2df83ea8002c6aa9ac5d7347834cea77b7e37d608b72ac79 |
|
MD5 | e27bbbedd1b3cb2bba140e93f1a95eae |
|
BLAKE2b-256 | b1d265b17db8d9de6f031cc556dc471e5e76e88219aa9b2b0f0d9092bddc40d5 |