Skip to main content

A helper library full of URL-related heuristics.

Project description

Build Status

Ural

A helper library full of URL-related heuristics.

Installation

You can install ural with pip with the following command:

pip install ural

Usage

Functions

Classes


Functions

ensure_protocol

Function checking if the url has a protocol, and adding the given one if there is none.

from ural import ensure_protocol

ensure_protocol('www2.lemonde.fr', protocol='https')
>>> 'https://www2.lemonde.fr'

Arguments

  • url string: URL to format.
  • protocol string: protocol to use if there is none in url. Is 'http' by default.

get_domain_name

Function returning an url's domain name. This function is of course tld-aware and will return None if no valid domain name can be found.

from ural import get_domain_name

get_domain_name('https://facebook.com/path')
>>> 'facebook.com'

Arguments

  • url string: Target url.

force_protocol

Function force-replacing the protocol of the given url.

from ural import force_protocol

force_protocol('https://www2.lemonde.fr', protocol='ftp')
>>> 'ftp://www2.lemonde.fr'

Arguments

  • url string: URL to format.
  • protocol string: protocol wanted in the output url. Is 'http' by default.

is_url

Function returning True if its argument is a url.

from ural import is_url

is_url('https://www2.lemonde.fr')
>>> True

Arguments

  • string string: string to test.
  • require_protocol boolean: whether the argument has to have a protocol to be considered a url. Is True by default.

lru_from_url

Function returning url parts in hierarchical order.

from ural import lru_from_url

lru_from_url('http://www.lemonde.fr:8000/article/1234/index.html?field=value#2')
>>> ['s:http', 't:8000', 'h:fr', 'h:lemonde', 'h:www', 'p:article', 'p:1234', 'p:index.html', 'q:field=value', 'f:2']

Arguments

  • url string: URL to parse.

normalize_url

Function normalizing the given url by stripping it of usually non-discriminant parts such as irrelevant query items or sub-domains etc.

This is a very useful utility when attempting to match similar urls written slightly differently when shared on social media etc.

from ural import normalize_url

normalize_url('https://www2.lemonde.fr/index.php?utm_source=google')
>>> 'lemonde.fr'

Arguments

  • url string: URL to normalize.
  • sort_query boolean [True]: whether to sort query items.
  • strip_authentication boolean [True]: whether to strip authentication.
  • strip_index boolean [True]: whether to strip trailing index.
  • strip_trailing_slash boolean [False]: whether to strip trailing slash.

normalized_lru_from_url

Function normalizing url and returning its parts in hierarchical order.

from ural import normalized_lru_from_url

normalized_lru_from_url('http://www.lemonde.fr:8000/article/1234/index.html?field=value#2')
>>> ['t:8000', 'h:fr', 'h:lemonde', 'h:www', 'p:article', 'p:1234', 'q:field=value']

Arguments

This function accepts the same arguments as normalize_url.


strip_protocol

Function removing the protocol from the url.

from ural import strip_protocol

strip_protocol('https://www2.lemonde.fr/index.php')
>>> 'www2.lemonde.fr/index.php'

Arguments

  • url string: URL to format.

urls_from_html

Function returning an iterator over the urls present in the links of given HTML text.

from ural import urls_from_html

html = """<p>Hey! Check this site: <a href="https://medialab.sciencespo.fr/">médialab</a></p>"""

for url in urls_from_html(html):
    print(url)
>>> 'https://medialab.sciencespo.fr/'

Arguments

  • string string: html string.

urls_from_text

Function returning an iterator over the urls present in the string argument. Extracts only the urls with a protocol.

from ural import urls_from_text

text = "Hey! Check this site: https://medialab.sciencespo.fr/, it looks really cool. They're developing many tools on https://github.com/"

for url in urls_from_text(text):
    print(url)
>>> 'https://medialab.sciencespo.fr/'
>>> 'https://github.com/'

Arguments

  • string string: source string.

Classes

LRUTrie

Class implementing a prefix tree (Trie) storing LRUs and their metadata, allowing to find the longest common prefix between two urls.

set

A function storing an url in a LRUTrie along with its metadata.

from ural import LRUTrie

trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'type': 'general press'})

trie.match('http://www.lemonde.fr')
>>> {'type': 'general press'}

Arguments

  • url string: url to store in the LRUTrie.
  • metadata dict: metadata of the url.

match

Method returning the metadata of the given url as it is stored in the LRUTrie. If the exact given url doesn't exist in the LRUTrie, it returns the metadata of the longest common prefix, or None if there is no common prefix.

from ural import LRUTrie

trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'media': 'lemonde'})

trie.match('http://www.lemonde.fr')
>>> {'media': 'lemonde'}
trie.match('http://www.lemonde.fr/politique')
>>> {'media': 'lemonde'}

Arguments

  • url string: url to match in the LRUTrie.

values

Method yielding the metadata of each url stored in the LRUTrie.

from ural import LRUTrie

trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'media' : 'lemonde'})
trie.set('http://www.lefigaro.fr', {'media' : 'lefigaro'})
trie.set('https://www.liberation.fr', {'media' : 'liberation'})

for value in trie.values():
  print(value)
>>> {'media': 'lemonde'}
>>> {'media': 'liberation'}
>>> {'media': 'lefigaro'}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ural-0.7.0.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

ural-0.7.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file ural-0.7.0.tar.gz.

File metadata

  • Download URL: ural-0.7.0.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for ural-0.7.0.tar.gz
Algorithm Hash digest
SHA256 298c0bc241381a55a9f331c815f077444829a5c03e68b11582269ad218c1c60a
MD5 7179bba54e58b359730100bc6204e75b
BLAKE2b-256 d129353336a136207dfd2f9743a8685eaa7890ebdd3d3da2e21b7762b941183c

See more details on using hashes here.

File details

Details for the file ural-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: ural-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5

File hashes

Hashes for ural-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc7be36f19bc68febadac8c3978aa5f9a1b5840c192d52f86a428cf44eaa8e1a
MD5 71135aeb587901a266d8369172894be3
BLAKE2b-256 d4e646fe4c9eb0f62793250f98ad8ea328670b30a554f9aaa57cdff0485cf21a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page