A helper library full of URL-related heuristics.
Project description
Ural
A helper library full of URL-related heuristics.
Installation
You can install ural
with pip with the following command:
pip install ural
Usage
Functions
Generic functions
- ensure_protocol
- get_domain_name
- force_protocol
- is_url
- lru_from_url
- normalize_url
- normalized_lru_from_url
- strip_protocol
- urls_from_html
- urls_from_text
Platform-specific functions
Classes
Functions
ensure_protocol
Function checking if the url has a protocol, and adding the given one if there is none.
from ural import ensure_protocol
ensure_protocol('www2.lemonde.fr', protocol='https')
>>> 'https://www2.lemonde.fr'
Arguments
- url string: URL to format.
- protocol string: protocol to use if there is none in url. Is 'http' by default.
get_domain_name
Function returning an url's domain name. This function is of course tld-aware and will return None
if no valid domain name can be found.
from ural import get_domain_name
get_domain_name('https://facebook.com/path')
>>> 'facebook.com'
Arguments
- url string: Target url.
force_protocol
Function force-replacing the protocol of the given url.
from ural import force_protocol
force_protocol('https://www2.lemonde.fr', protocol='ftp')
>>> 'ftp://www2.lemonde.fr'
Arguments
- url string: URL to format.
- protocol string: protocol wanted in the output url. Is
'http'
by default.
is_url
Function returning True if its argument is a url.
from ural import is_url
is_url('https://www2.lemonde.fr')
>>> True
is_url('lemonde.fr/economie/article.php', require_protocol=False)
>>> True
is_url('lemonde.falsetld/whatever.html', tld_aware=True)
>>> False
Arguments
- string string: string to test.
- require_protocol boolean [
True
]: whether the argument has to have a protocol to be considered a url. - tld_aware boolean [
False
]: whether to check if the url's tld actually exists or not.
lru_from_url
Function returning url parts in hierarchical order.
from ural import lru_from_url
lru_from_url('http://www.lemonde.fr:8000/article/1234/index.html?field=value#2')
>>> ['s:http', 't:8000', 'h:fr', 'h:lemonde', 'h:www', 'p:article', 'p:1234', 'p:index.html', 'q:field=value', 'f:2']
Arguments
- url string: URL to parse.
normalize_url
Function normalizing the given url by stripping it of usually non-discriminant parts such as irrelevant query items or sub-domains etc.
This is a very useful utility when attempting to match similar urls written slightly differently when shared on social media etc.
from ural import normalize_url
normalize_url('https://www2.lemonde.fr/index.php?utm_source=google')
>>> 'lemonde.fr'
Arguments
- url string: URL to normalize.
- sort_query boolean [
True
]: whether to sort query items. - strip_authentication boolean [
True
]: whether to strip authentication. - strip_fragment boolean|str [
'except-routing'
]: whether to strip the url's fragment. If set toexcept-routing
, will only strip the fragment if the fragment is not deemed to be js routing (i.e. if it contains a/
). - strip_index boolean [
True
]: whether to strip trailing index. - strip_lang_subdomains boolean [
False
]: whether to strip language subdomains (ex: 'fr-FR.lemonde.fr' to only 'lemonde.fr' because 'fr-FR' isn't a relevant subdomain, it indicates the language and the country). - strip_trailing_slash boolean [
False
]: whether to strip trailing slash.
normalized_lru_from_url
Function normalizing url and returning its parts in hierarchical order.
from ural import normalized_lru_from_url
normalized_lru_from_url('http://www.lemonde.fr:8000/article/1234/index.html?field=value#2')
>>> ['t:8000', 'h:fr', 'h:lemonde', 'h:www', 'p:article', 'p:1234', 'q:field=value']
Arguments
This function accepts the same arguments as normalize_url.
strip_protocol
Function removing the protocol from the url.
from ural import strip_protocol
strip_protocol('https://www2.lemonde.fr/index.php')
>>> 'www2.lemonde.fr/index.php'
Arguments
- url string: URL to format.
urls_from_html
Function returning an iterator over the urls present in the links of given HTML text.
from ural import urls_from_html
html = """<p>Hey! Check this site: <a href="https://medialab.sciencespo.fr/">médialab</a></p>"""
for url in urls_from_html(html):
print(url)
>>> 'https://medialab.sciencespo.fr/'
Arguments
- string string: html string.
urls_from_text
Function returning an iterator over the urls present in the string argument. Extracts only the urls with a protocol.
from ural import urls_from_text
text = "Hey! Check this site: https://medialab.sciencespo.fr/, it looks really cool. They're developing many tools on https://github.com/"
for url in urls_from_text(text):
print(url)
>>> 'https://medialab.sciencespo.fr/'
>>> 'https://github.com/'
Arguments
- string string: source string.
convert_facebook_url_to_mobile
Function returning the mobile version of the given Facebook url. Will raise an exception if a non-Facebook url is given.
from ural.facebook import convert_facebook_url_to_mobile
convert_facebook_url_to_mobile('http://www.facebook.com/post/974583586343')
>>> 'http://m.facebook.com/post/974583586343'
extract_user_from_url
Function extracting user information from a facebook user url.
from ural.facebook import extract_user_from_url
extract_user_from_url('https://www.facebook.com/people/Sophia-Aman/102016783928989')
>>> FacebookUser(id='102016783928989', handle=None, url='https://www.facebook.com/profile.php?id=102016783928989)
extract_user_from_url('/annelaure.rivolu?rc=p&__tn__=R')
>>> FacebookUser(id=None, handle='annelaure.rivolu', url='https://www.facebook.com/annelaure.rivolu)
Classes
LRUTrie
Class implementing a prefix tree (Trie) storing LRUs and their metadata, allowing to find the longest common prefix between two urls.
set
A function storing an url in a LRUTrie along with its metadata.
from ural import LRUTrie
trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'type': 'general press'})
trie.match('http://www.lemonde.fr')
>>> {'type': 'general press'}
Arguments
- url string: url to store in the LRUTrie.
- metadata dict: metadata of the url.
match
Method returning the metadata of the given url as it is stored in the LRUTrie.
If the exact given url doesn't exist in the LRUTrie, it returns the metadata of the longest common prefix, or None
if there is no common prefix.
from ural import LRUTrie
trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'media': 'lemonde'})
trie.match('http://www.lemonde.fr')
>>> {'media': 'lemonde'}
trie.match('http://www.lemonde.fr/politique')
>>> {'media': 'lemonde'}
Arguments
- url string: url to match in the LRUTrie.
values
Method yielding the metadata of each url stored in the LRUTrie.
from ural import LRUTrie
trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'media' : 'lemonde'})
trie.set('http://www.lefigaro.fr', {'media' : 'lefigaro'})
trie.set('https://www.liberation.fr', {'media' : 'liberation'})
for value in trie.values():
print(value)
>>> {'media': 'lemonde'}
>>> {'media': 'liberation'}
>>> {'media': 'lefigaro'}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ural-0.10.1.tar.gz
.
File metadata
- Download URL: ural-0.10.1.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d8a7333b9248d0e7c672649f0a6dd980f6cb60971000bcf9b6d1dd193674d52 |
|
MD5 | 1dc408683a452d5666fa2bb02b60745c |
|
BLAKE2b-256 | d5119d658ebae288c3e1da2a475801a49bee9008a21543a31b0dce95c4a543b0 |
File details
Details for the file ural-0.10.1-py3-none-any.whl
.
File metadata
- Download URL: ural-0.10.1-py3-none-any.whl
- Upload date:
- Size: 18.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 789fe2c64ce6bedfb09452708a9f1f75d2e7fd490a899472763e45e8a912774f |
|
MD5 | 8e44c48c8cee35d6884c55458dd3c3ef |
|
BLAKE2b-256 | 9125aa0e930c4de9c7b76b5f216da38602c27055eaf4190336111e00522854c5 |