Skip to main content

A helper library full of URL-related heuristics.

Project description

Build Status

Ural

A helper library full of URL-related heuristics.

Installation

You can install ural with pip with the following command:

pip install ural

Usage

Generic functions

Utilities

Classes

LRU-related functions (What on earth is a LRU?)

LRU-related classes

Platform-specific functions


could_be_html

Function returning whether the url could return HTML.

from ural import could_be_html

could_be_html('https://www.lemonde.fr')
>>> True

could_be_html('https://www.lemonde.fr/articles/page.php')
>>> True

could_be_html('https://www.lemonde.fr/data.json')
>>> False

could_be_html('https://www.lemonde.fr/img/figure.jpg')
>>> False

ensure_protocol

Function checking if the url has a protocol, and adding the given one if there is none.

from ural import ensure_protocol

ensure_protocol('www.lemonde.fr', protocol='https')
>>> 'https://www.lemonde.fr'

Arguments

  • url string: URL to format.
  • protocol string: protocol to use if there is none in url. Is 'http' by default.

force_protocol

Function force-replacing the protocol of the given url.

from ural import force_protocol

force_protocol('https://www2.lemonde.fr', protocol='ftp')
>>> 'ftp://www2.lemonde.fr'

Arguments

  • url string: URL to format.
  • protocol string: protocol wanted in the output url. Is 'http' by default.

format_url

Function formatting a url given some typical parameters.

from ural import format_url

format_url(
  'https://lemonde.fr',
  path='/article.html',
  args={'id': '48675'},
  fragment='title-2'
)
>>> 'https://lemonde.fr/article.html?id=48675#title-2'

# Path can be given as an iterable
format_url('https://lemonde.fr', path=['articles', 'one.html'])
>>> 'https://lemonde.fr/articles/one.html'

# Extension
format_url('https://lemonde.fr', path=['article'], ext='html')
>>> 'https://lemonde.fr/articles/article.html'

# Query args are formatted/quoted and/or skipped if None/False
format_url(
  "http://lemonde.fr",
  path=["business", "articles"],
  args={
    "hello": "world",
    "number": 14,
    "boolean": True,
    "skipped": None,
    "also-skipped": False,
    "quoted": "test=ok",
  },
  fragment="#test",
)
>>> 'http://lemonde.fr/business/articles?boolean&hello=world&number=14&quoted=test%3Dok#test'

# Custom argument value formatting
def format_arg_value(key, value):
  if key == 'ids':
    return ','.join(value)

  return key

format_url(
  'https://lemonde.fr',
  args={'ids': [1, 2]},
  format_arg_value=format_arg_value
)
>>> 'https://lemonde.fr?ids=1%2C2'

# Formatter class
from ural import URLFormatter

formatter = URLFormatter('https://lemonde.fr', args={'id': 'one'})

formatter(path='/article.html')
>>> 'https://lemonde.fr/article.html?id=one'

# same as:
formatter.format(path='/article.html')
>>> 'https://lemonde.fr/article.html?id=one'

# Query arguments are merged
formatter(path='/article.html', args={"user_id": "two"})
>>> 'https://lemonde.fr/article.html?id=one&user_id=two'

# Easy subclassing
class MyCustomFormatter(URLFormatter):
  BASE_URL = 'https://lemonde.fr/api'

  def format_api_call(self, token):
    return self.format(args={'token': token})

formatter = MyCustomFormatter()

formatter.format_api_call('2764753')
>>> 'https://lemonde.fr/api?token=2764753'

Arguments

  • base_url str: Base url.
  • path ?str|list: the url's path.
  • args ?dict: query arguments as a dictionary.
  • format_arg_value ?callable: function taking a query argument key and value and returning the formatted value.
  • fragment ?str: the url's fragment.
  • ext ?str: path extension such as .html.

get_domain_name

Function returning an url's domain name. This function is of course tld-aware and will return None if no valid domain name can be found.

from ural import get_domain_name

get_domain_name('https://facebook.com/path')
>>> 'facebook.com'

get_hostname

Function returning the given url's full hostname. It can work on scheme-less urls.

from ural import get_hostname

get_hostname('http://www.facebook.com/path')
>>> 'www.facebook.com'

get_normalized_hostname

Function returning the given url's normalized hostname, i.e. without usually irrelevant subdomains etc. Works a lot like normalize_url.

from ural import get_normalized_hostname

get_normalized_hostname('http://www.facebook.com/path')
>>> 'facebook.com'

get_normalized_hostname('http://fr-FR.facebook.com/path', strip_lang_subdomains=True)
>>> 'facebook.com'

Arguments

  • url str: Target url.
  • infer_redirection bool [True]: whether to attempt resolving common redirects by leveraging well-known GET parameters.
  • normalize_amp ?bool [True]: Whether to attempt to normalize Google AMP subdomains.
  • strip_lang_subdomains ?bool [False]: Whether to drop language-specific subdomains.

has_special_host

Function returning whether the given url looks like it has a special host.

from ural import has_special_host

has_special_host('http://104.19.154.83')
>>> True

has_special_host('http://youtube.com')
>>> False

has_valid_suffix

Function returning whether the given url has a valid suffix as per Mozzila's Public Suffix List.

from ural import has_valid_suffix

has_valid_suffix('http://lemonde.fr')
>>> True

has_valid_suffix('http://lemonde.doesnotexist')
>>> False

# Also works with hostnames
has_valid_suffix('lemonde.fr')
>>> True

has_valid_tld

Function returning whether the given url has a valid Top Level Domain (TLD) as per IANA's list.

from ural import has_valid_tld

has_valid_tld('http://lemonde.fr')
>>> True

has_valid_tld('http://lemonde.doesnotexist')
>>> False

# Also works with hostnames
has_valid_tld('lemonde.fr')
>>> True

infer_redirection

Function attempting to find obvious clues in the given url that it is in fact a redirection and resolving the redirection automatically without firing any HTTP request. If nothing is found, the given url will be returned as-is.

The function is by default recursive and will attempt to infer redirections until none is found, but you can disable this behavior if you need to.

from ural import infer_redirection

infer_redirection('https://www.google.com/url?sa=t&source=web&rct=j&url=https%3A%2F%2Fm.youtube.com%2Fwatch%3Fv%3D4iJBsjHMviQ&ved=2ahUKEwiBm-TO3OvkAhUnA2MBHQRPAR4QwqsBMAB6BAgDEAQ&usg=AOvVaw0i7y2_fEy3nwwdIZyo_qug')
>>> 'https://m.youtube.com/watch?v=4iJBsjHMviQ'

infer_redirection('https://test.com?url=http%3A%2F%2Flemonde.fr%3Fnext%3Dhttp%253A%252F%252Ftarget.fr')
>>> 'http://target.fr'

infer_redirection(
  'https://test.com?url=http%3A%2F%2Flemonde.fr%3Fnext%3Dhttp%253A%252F%252Ftarget.fr',
  recursive=False
)
>>> 'http://lemonde.fr?next=http%3A%2F%2Ftarget.fr'

is_homepage

Function returning whether the given url is probably a website's homepage, based on its path.

from ural import is_homepage

is_homepage('http://lemonde.fr')
>>> True

is_homepage('http://lemonde.fr/index.html')
>>> True

is_homepage('http://lemonde.fr/business/article5.html')
>>> False

is_shortened_url

Function returning whether the given url is probably a shortened url. It works by matching the given url domain against most prominent shortener domains. So the result could be a false negative.

from ural import is_shortened_url

is_shortened_url('http://lemonde.fr')
>>> False

is_shortened_url('http://bit.ly/1sNZMwL')
>>> True

is_special_host

Function returning whether the given hostname looks like a special host.

from ural import is_special_host

is_special_host('104.19.154.83')
>>> True

is_special_host('youtube.com')
>>> False

is_typo_url

Function returning whether the given string is probably a typo error. This function doesn't test if the given string is a valid url. It works by matching the given url tld against most prominent typo-like tlds or by matching the given string against most prominent inclusive language terminations. So the result could be a false negative.

from ural import is_typo_url

is_typo_url('http://dirigeants.es')
>>> True

is_typo_url('https://www.instagram.com')
>>> False

is_url

Function returning whether the given string is a valid url.

from ural import is_url

is_url('https://www2.lemonde.fr')
>>> True

is_url('lemonde.fr/economie/article.php', require_protocol=False)
>>> True

is_url('lemonde.falsetld/whatever.html', tld_aware=True)
>>> False

Arguments

  • string string: string to test.
  • require_protocol bool [True]: whether the argument has to have a protocol to be considered a url.
  • tld_aware bool [False]: whether to check if the url's tld actually exists or not.
  • allow_spaces_in_path bool [False]: whether the allow spaces in URL paths.
  • only_http_https bool [True]: whether to only allow the http and https protocols.

is_valid_tld

Function returning whether the given Top Level Domain (TLD) is valid as per IANA's list.

from ural import is_valid_tld

is_valid_tld('.fr')
>>> True

is_valid_tld('com')
>>> True

is_valid_tld('.doesnotexist')
>>> False

normalize_hostname

Function normalizing the given hostname, i.e. without usually irrelevant subdomains etc. Works a lot like normalize_url.

from ural import normalize_hostname

normalize_hostname('www.facebook.com')
>>> 'facebook.com'

normalize_hostname('fr-FR.facebook.com', strip_lang_subdomains=True)
>>> 'facebook.com'

normalize_url

Function normalizing the given url by stripping it of usually non-discriminant parts such as irrelevant query items or sub-domains etc.

This is a very useful utility when attempting to match similar urls written slightly differently when shared on social media etc.

from ural import normalize_url

normalize_url('https://www2.lemonde.fr/index.php?utm_source=google')
>>> 'lemonde.fr'

Arguments

  • url string: URL to normalize.
  • infer_redirection ?bool [True]: whether to attempt resolving common redirects by leveraging well-known GET parameters.
  • fix_common_mistakes ?bool [True]: whether to attempt to fix common URL mistakes.
  • normalize_amp ?bool [True]: whether to attempt to normalize Google AMP urls.
  • quoted ?bool [True]: whether to normalize to a quoted or unquoted version of the url.
  • sort_query ?bool [True]: whether to sort query items.
  • strip_authentication ?bool [True]: whether to strip authentication.
  • strip_fragment ?bool|str ['except-routing']: whether to strip the url's fragment. If set to except-routing, will only strip the fragment if the fragment is not deemed to be js routing (i.e. if it contains a /).
  • strip_index ?bool [True]: whether to strip trailing index.
  • strip_irrelevant_subdomains ?bool [False]: whether to strip irrelevant subdomains such as www etc.
  • strip_lang_query_items ?bool [False]: whether to strip language query items (ex: gl=pt_BR).
  • strip_lang_subdomains ?bool [False]: whether to strip language subdomains (ex: fr-FR.lemonde.fr to only lemonde.fr because fr-FR isn't a relevant subdomain, it indicates the language and the country).
  • strip_protocol ?bool [True]: whether to strip the url's protocol.
  • strip_trailing_slash ?bool [True]: whether to strip trailing slash.
  • unsplit ?bool [True]: whether to return a stringified version of the normalized url or directly the SplitResult instance worked on by the normalization process.

should_follow_href

Function returning whether the given href should be followed (usually from a crawler's context). This means it will filter out anchors, and url having a scheme which is not http/https.

from ural import should_follow_href

should_follow_href('#top')
>>> False

should_follow_href('http://lemonde.fr')
>>> True

should_follow_href('/article.html')
>>> True

should_resolve

Function returning whether the given function looks like something you would want to resolve because the url will probably lead to some redirection.

It is quite similar to is_shortened_url but covers more ground since it also deal with url patterns which are not shortened per se.

from ural import should_resolve

should_resolve('http://lemonde.fr')
>>> False

should_resolve('http://bit.ly/1sNZMwL')
>>> True

should_resolve('https://doi.org/10.4000/vertigo.26405')
>>> True

split_suffix

Function splitting a hostname or a url's hostname into the domain part and the suffix part (while respecting Mozzila's Public Suffix List).

from ural import split_suffix

split_suffix('http://www.bbc.co.uk/article.html')
>>> ('www.bbc', 'co.uk')

split_suffix('http://www.bbc.idontexist')
>>> None

split_suffix('lemonde.fr')
>>> ('lemonde', 'fr')

strip_protocol

Function removing the protocol from the url.

from ural import strip_protocol

strip_protocol('https://www2.lemonde.fr/index.php')
>>> 'www2.lemonde.fr/index.php'

Arguments

  • url string: URL to format.

urlpathsplit

Function taking a url and returning its path, tokenized as a list.

from ural import urlpathsplit

urlpathsplit('http://lemonde.fr/section/article.html')
>>> ['section', 'article.html']

urlpathsplit('http://lemonde.fr/')
>>> []

# If you want to split a path directly
from ural import pathsplit

pathsplit('/section/articles/')
>>> ['section', 'articles']

urls_from_html

Function returning an iterator over the urls present in the links of given HTML text.

from ural import urls_from_html

html = """<p>Hey! Check this site: <a href="https://medialab.sciencespo.fr/">médialab</a></p>"""

for url in urls_from_html(html):
    print(url)
>>> 'https://medialab.sciencespo.fr/'

Arguments

  • string string: html string.
  • base_url ?string: base url to join with relative urls.

urls_from_text

Function returning an iterator over the urls present in the string argument. Extracts only urls having a protocol.

Note that this function is somewhat markdown-aware, and punctuation-aware.

from ural import urls_from_text

text = "Hey! Check this site: https://medialab.sciencespo.fr/, it looks really cool. They're developing many tools on https://github.com/"

for url in urls_from_text(text):
    print(url)

>>> 'https://medialab.sciencespo.fr/'
>>> 'https://github.com/'

Arguments

  • string string: source string.

Upgrading suffixes and TLDs

If you want to upgrade the package's data wrt Mozilla suffixes and IANA TLDs, you can do so either by running the following command:

python -m ural upgrade

or directly in your python code:

from ural.tld import upgrade

upgrade()

# Or if you want to patch runtime only this time, or regularly
# (for long running programs or to avoid rights issues etc.):
upgrade(transient=True)

HostnameTrieSet

Class implementing a hierarchic set of hostnames so you can efficiently query whether urls match hostnames in the set.

from ural import HostnameTrieSet

trie = HostnameTrieSet()

trie.add('lemonde.fr')
trie.add('business.lefigaro.fr')

trie.match('https://liberation.fr/article1.html')
>>> False

trie.match('https://lemonde.fr/article1.html')
>>> True

trie.match('https://www.lemonde.fr/article1.html')
>>> True

trie.match('https://lefigaro.fr/article1.html')
>>> False

trie.match('https://business.lefigaro.fr/article1.html')
>>> True

#.add

Method add a single hostname to the set.

from ural import HostnameTrieSet

trie = HostnameTrieSet()
trie.add('lemonde.fr')

Arguments

  • hostname string: hostname to add to the set.

#.match

Method returning whether the given url matches any of the set's hostnames.

from ural import HostnameTrieSet

trie = HostnameTrieSet()
trie.add('lemonde.fr')

trie.match('https://liberation.fr/article1.html')
>>> False

trie.match('https://lemonde.fr/article1.html')
>>> True

Arguments

  • url string|urllib.parse.SplitResult: url to match.

lru.url_to_lru

Function converting the given url to a serialized lru.

from ural.lru import url_to_lru

url_to_lru('http://www.lemonde.fr:8000/article/1234/index.html?field=value#2')
>>> 's:http|t:8000|h:fr|h:lemonde|h:www|p:article|p:1234|p:index.html|q:field=value|f:2|'

Arguments

  • url string: url to convert.
  • suffix_aware ?bool: whether to be mindful of suffixes when converting (e.g. considering "co.uk" as a single token).

lru.lru_to_url

Function converting the given serialized lru or lru stems to a proper url.

from ural.lru import lru_to_url

lru_to_url('s:http|t:8000|h:fr|h:lemonde|h:www|p:article|p:1234|p:index.html|')
>>> 'http://www.lemonde.fr:8000/article/1234/index.html'

lru_to_url(['s:http', 'h:fr', 'h:lemonde', 'h:www', 'p:article', 'p:1234', 'p:index.html'])
>>> 'http://www.lemonde.fr:8000/article/1234/index.html'

lru.lru_stems

Function returning url parts in hierarchical order.

from ural.lru import lru_stems

lru_stems('http://www.lemonde.fr:8000/article/1234/index.html?field=value#2')
>>> ['s:http', 't:8000', 'h:fr', 'h:lemonde', 'h:www', 'p:article', 'p:1234', 'p:index.html', 'q:field=value', 'f:2']

Arguments

  • url string: URL to parse.
  • suffix_aware ?bool: whether to be mindful of suffixes when converting (e.g. considering "co.uk" as a single token).

lru.normalized_lru_stems

Function normalizing url and returning its parts in hierarchical order.

from ural.lru import normalized_lru_stems

normalized_lru_stems('http://www.lemonde.fr/article/1234/index.html?field=value#2')
>>> ['h:fr', 'h:lemonde', 'p:article', 'p:1234', 'q:field=value']

Arguments

This function accepts the same arguments as normalize_url.


lru.serialize_lru

Function serializing lru stems to a string.

from ural.lru import serialize_lru

serialize_lru(['s:https', 'h:fr', 'h:lemonde'])
>>> 's:https|h:fr|h:lemonde|'

lru.unserialize_lru

Function unserializing stringified lru to a list of stems.

from ural.lru import unserialize_lru

unserialize_lru('s:https|h:fr|h:lemonde|')
>>> ['s:https', 'h:fr', 'h:lemonde']

LRUTrie

Class implementing a prefix tree (Trie) storing URLs hierarchically by storing them as LRUs along with some arbitrary metadata. It is very useful when needing to match URLs by longest common prefix.

Note that this class directly inherits from the phylactery library's TrieDict so you can also use any of its methods.

from ural.lru import LRUTrie

trie = LRUTrie()

# To respect suffixes
trie = LRUTrie(suffix_aware=True)

#.set

Method storing a URL in a LRUTrie along with its metadata.

from ural.lru import LRUTrie

trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'type': 'general press'})

trie.match('http://www.lemonde.fr')
>>> {'type': 'general press'}

Arguments

  • url string: url to store in the LRUTrie.
  • metadata any: metadata of the url.

#.set_lru

Method storing a URL already represented as a LRU or LRU stems along with its metadata.

from ural.lru import LRUTrie

trie = LRUTrie()

# Using stems
trie.set_lru(['s:http', 'h:fr', 'h:lemonde', 'h:www'], {'type': 'general press'})

# Using serialized lru
trie.set_lru('s:http|h:fr|h:lemonde|h:www|', {'type': 'general_press'})

Arguments

  • lru string|list: lru to store in the Trie.
  • metadata any: metadata to attach to the lru.

#.match

Method returning the metadata attached to the longest prefix match of your query URL. Will return None if no common prefix can be found.

from ural.lru import LRUTrie

trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'media': 'lemonde'})

trie.match('http://www.lemonde.fr')
>>> {'media': 'lemonde'}
trie.match('http://www.lemonde.fr/politique')
>>> {'media': 'lemonde'}

trie.match('http://www.lefigaro.fr')
>>> None

Arguments

  • url string: url to match in the LRUTrie.

#.match_lru

Method returning the metadata attached to the longest prefix match of your query LRU. Will return None if no common prefix can be found.

from ural.lru import LRUTrie

trie = LRUTrie()
trie.set(['s:http', 'h:fr', 'h:lemonde', 'h:www'], {'media': 'lemonde'})

trie.match(['s:http', 'h:fr', 'h:lemonde', 'h:www'])
>>> {'media': 'lemonde'}
trie.match('s:http|h:fr|h:lemonde|h:www|p:politique|')
>>> {'media': 'lemonde'}

trie.match(['s:http', 'h:fr', 'h:lefigaro', 'h:www'])
>>> None

Arguments

  • lru string|list: lru to match in the LRUTrie.

NormalizedLRUTrie

The NormalizedLRUTrie is nearly identical to the standard LRUTrie except that it normalized urls given to it before attempting any operation. It is a good choice if you want to avoid prefix queries issues related to http vs https or www shenanigans, for instance.

To tweak its normalization, you can give to NormalizedLRUTrie the same options you would give to normalize_url:

from ural.lru import NormalizedLRUTrie

trie = NormalizedLRUTrie(normalize_amp=False)

Note that there are still some differences between the LRUTrie and the NormalizedLRUTrie:

  1. The NormalizedLRUTrie cannot be TLD aware.
  2. The NormalizedLRUTrie does not have the #.set_lru and #.match_lru methods.

Facebook

has_facebook_comments

Function returning whether the given url is pointing to a Facebook resource potentially having comments (such as a post, photo or video for instance).

from ural.facebook import has_facebook_comments

has_facebook_comments('https://www.facebook.com/permalink.php?story_fbid=1354978971282622&id=598338556946671')
>>> True

has_facebook_comments('https://www.facebook.com/108824017345866/videos/311658803718223')
>>> True

has_facebook_comments('https://www.facebook.com/astucerie/')
>>> False

has_facebook_comments('https://www.lemonde.fr')
>>> False

has_facebook_comments('/permalink.php?story_fbid=1354978971282622&id=598338556946671', allow_relative_urls=True)
>>> True

is_facebook_id

Function returning whether the given string is a valid Facebook id or not.

from ural.facebook import is_facebook_id

is_facebook_id('974583586343')
>>> True

is_facebook_id('whatever')
>>> False

is_facebook_full_id

Function returning whether the given string is a valid Facebook full post id or not.

from ural.facebook import is_facebook_full_id

is_facebook_full_id('974583586343_9749757953')
>>> True

is_facebook_full_id('974583586343')
>>> False

is_facebook_full_id('whatever')
>>> False

is_facebook_url

Function returning whether given url is from Facebook or not.

from ural.facebook import is_facebook_url

is_facebook_url('http://www.facebook.com/post/974583586343')
>>> True

is_facebook_url('https://fb.me/846748464')
>>> True

is_facebook_url('https://www.lemonde.fr')
>>> False

is_facebook_post_url

Function returning whether the given url is a Facebook post or not.

from ural.facebook import is_facebook_post_url

is_facebook_post_url('http://www.facebook.com/post/974583586343')
>>> True

is_facebook_post_url('http://www.facebook.com')
>>> False

is_facebook_post_url('https://www.lemonde.fr')
>>> False

is_facebook_link

Function returning whether the given url is a Facebook redirection link.

from ural.facebook import is_facebook_link

is_facebook_link('https://l.facebook.com/l.php?u=http%3A%2F%2Fwww.chaos-controle.com%2Farchives%2F2013%2F10%2F14%2F28176300.html&amp;h=AT0iUqJpUTMzHAH8HAXwZ11p8P3Z-SrY90wIXZhcjMnxBTHMiau8Fv1hvz00ZezRegqmF86SczyUXx3Gzdt_MdFH-I4CwHIXKKU9L6w522xwOqkOvLAylxojGEwrp341uC-GlVyGE2N7XwTPK9cpP0mQ8PIrWh8Qj2gHIIR08Js0mUr7G8Qe9fx66uYcfnNfTTF1xi0Us8gTo4fOZxAgidGWXsdgtU_OdvQqyEm97oHzKbWfXjkhsrzbtb8ZNMDwCP5099IMcKRD8Hi5H7W3vwh9hd_JlRgm5Z074epD_mGAeoEATE_QUVNTxO0SHO4XNn2Z7LgBamvevu1ENBcuyuSOYA0BsY2cx8mPWJ9t44tQcnmyQhBlYm_YmszDaQx9IfVP26PRqhsTLz-kZzv0DGMiJFU78LVWVPc9QSw2f9mA5JYWr29w12xJJ5XGQ6DhJxDMWRnLdG8Tnd7gZKCaRdqDER1jkO72u75-o4YuV3CLh4j-_4u0fnHSzHdVD8mxr9pNEgu8rvJF1E2H3-XbzA6F2wMQtFCejH8MBakzYtTGNvHSexSiKphE04Ci1Z23nBjCZFsgNXwL3wbIXWfHjh2LCKyihQauYsnvxp6fyioStJSGgyA9GGEswizHa20lucQF0S0F8H9-')
>>> True

is_facebook_link('https://lemonde.fr')
>>> False

convert_facebook_url_to_mobile

Function returning the mobile version of the given Facebook url. Will raise an exception if a non-Facebook url is given.

from ural.facebook import convert_facebook_url_to_mobile

convert_facebook_url_to_mobile('http://www.facebook.com/post/974583586343')
>>> 'http://m.facebook.com/post/974583586343'

parse_facebook_url

Function parsing the given Facebook url.

from ural.facebook import parse_facebook_url

# Importing related classes if you need to perform tests
from ural.facebook import (
  FacebookHandle,
  FacebookUser,
  FacebookGroup,
  FacebookPost,
  FacebookPhoto,
  FacebookVideo
)

parse_facebook_url('https://www.facebook.com/people/Sophia-Aman/102016783928989')
>>> FacebookUser(id='102016783928989')

parse_facebook_url('https://www.facebook.com/groups/159674260452951')
>>> FacebookGroup(id='159674260452951')

parse_facebook_url('https://www.facebook.com/groups/159674260852951/permalink/1786992671454427/')
>>> FacebookPost(id='1786992671454427', group_id='159674260852951')

parse_facebook_url('https://www.facebook.com/108824017345866/videos/311658803718223')
>>> FacebookVideo(id='311658803718223', parent_id='108824017345866')

parse_facebook_url('https://www.facebook.com/photo.php?fbid=10222721681573727')
>>> FacebookPhoto(id='10222721681573727')

parse_facebook_url('/annelaure.rivolu?rc=p&__tn__=R', allow_relative_urls=True)
>>> FacebookHandle(handle='annelaure.rivolu')

parse_facebook_url('https://lemonde.fr')
>>> None

extract_url_from_facebook_link

Function extracting target url from a Facebook redirection link.

from ural.facebook import extract_url_from_facebook_link

extract_url_from_facebook_link('https://l.facebook.com/l.php?u=http%3A%2F%2Fwww.chaos-controle.com%2Farchives%2F2013%2F10%2F14%2F28176300.html&amp;h=AT0iUqJpUTMzHAH8HAXwZ11p8P3Z-SrY90wIXZhcjMnxBTHMiau8Fv1hvz00ZezRegqmF86SczyUXx3Gzdt_MdFH-I4CwHIXKKU9L6w522xwOqkOvLAylxojGEwrp341uC-GlVyGE2N7XwTPK9cpP0mQ8PIrWh8Qj2gHIIR08Js0mUr7G8Qe9fx66uYcfnNfTTF1xi0Us8gTo4fOZxAgidGWXsdgtU_OdvQqyEm97oHzKbWfXjkhsrzbtb8ZNMDwCP5099IMcKRD8Hi5H7W3vwh9hd_JlRgm5Z074epD_mGAeoEATE_QUVNTxO0SHO4XNn2Z7LgBamvevu1ENBcuyuSOYA0BsY2cx8mPWJ9t44tQcnmyQhBlYm_YmszDaQx9IfVP26PRqhsTLz-kZzv0DGMiJFU78LVWVPc9QSw2f9mA5JYWr29w12xJJ5XGQ6DhJxDMWRnLdG8Tnd7gZKCaRdqDER1jkO72u75-o4YuV3CLh4j-_4u0fnHSzHdVD8mxr9pNEgu8rvJF1E2H3-XbzA6F2wMQtFCejH8MBakzYtTGNvHSexSiKphE04Ci1Z23nBjCZFsgNXwL3wbIXWfHjh2LCKyihQauYsnvxp6fyioStJSGgyA9GGEswizHa20lucQF0S0F8H9-')
>>> 'http://www.chaos-controle.com/archives/2013/10/14/28176300.html'

extract_url_from_facebook_link('http://lemonde.fr')
>>> None

Google

is_amp_url

Returns whether the given url is probably a Google AMP url.

from ural.google import is_amp_url

is_amp_url('http://www.europe1.fr/sante/les-onze-vaccins.amp')
>>> True

is_amp_url('https://www.lemonde.fr')
>>> False

is_google_link

Returns whether the given url is a Google search link.

from ural.google import is_google_link

is_google_link('https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=2ahUKEwjp8Lih_LnmAhWQlxQKHVTmCJYQFjADegQIARAB&url=http%3A%2F%2Fwww.mon-ip.com%2F&usg=AOvVaw0sfeZJyVtUS2smoyMlJmes')
>>> True

is_google_link('https://www.lemonde.fr')
>>> False

extract_url_from_google_link

Extracts the url from the given Google search link. This is useful to "resolve" the links scraped from Google's search results. Returns None if given url is not valid nor relevant.

from ural.google import extract_url_from_google_link

extract_url_from_google_link('https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwicu4K-rZzmAhWOEBQKHRNWA08QFjAAegQIARAB&url=https%3A%2F%2Fwww.facebook.com%2Fieff.ogbeide&usg=AOvVaw0vrBVCiIHUr5pncjeLpPUp')

>>> 'https://www.facebook.com/ieff.ogbeide'

extract_url_from_google_link('https://www.lemonde.fr')
>>> None

extract_id_from_google_drive_url

Extracts a file id from the given Google drive url. Returns None if given url is not valid nor relevant.

from ural.google import extract_id_from_google_drive_url

extract_id_from_google_drive_url('https://docs.google.com/spreadsheets/d/1Q9sJtAb1BZhUMjxCLMrVASx3AoNDp5iV3VkbPjlg/edit#gid=0')
>>> '1Q9sJtAb1BZhUMjxCLMrVASx3AoNDp5iV3VkbPjlg'

extract_id_from_google_drive_url('https://www.lemonde.fr')
>>> None

parse_google_drive_url

Parse the given Google drive url. Returns None if given is not valid nor relevant.

from ural.google import (
  parse_google_drive_url,
  GoogleDriveFile,
  GoogleDrivePublicLink
)

parse_google_drive_url('https://docs.google.com/spreadsheets/d/1Q9sJtAb1BZhUMjxCLMrVASx3AoNDp5iV3VkbPjlg/edit#gid=0')
>>> GoogleDriveFile('spreadsheets', '1Q9sJtAb1BZhUMjxCLMrVASx3AoNDp5iV3VkbPjlg')

parse_google_drive_url('https://www.lemonde.fr')
>>> None

Instagram

is_instagram_post_shortcode

Function returning whether the given string is a valid Instagram post shortcode or not.

from ural.instagram import is_instagram_post_shortcode

is_instagram_post_shortcode('974583By-5_86343')
>>> True

is_instagram_post_shortcode('whatever!!')
>>> False

is_instagram_username

Function returning whether the given string is a valid Instagram username or not.

from ural.instagram import is_instagram_username

is_instagram_username('97458.3By-5_86343')
>>> True

is_instagram_username('whatever!!')
>>> False

is_instagram_url

Returns whether the given url is from Instagram.

from ural.instagram import is_instagram_url

is_instagram_url('https://lemonde.fr')
>>> False

is_instagram_url('https://www.instagram.com/guillaumelatorre')
>>> True

extract_username_from_instagram_url

Return a username from the given Instagram url or None if we could not find one.

from ural.instagram import extract_username_from_instagram_url

extract_username_from_instagram_url('https://www.instagram.com/martin_dupont/p/BxKRx5CHn5i/')
>>> 'martin_dupont'

extract_username_from_instagram_url('https://lemonde.fr')
>>> None

parse_instagram_url

Returns parsed information about the given Instagram url: either about the post, the user or the reel. If the url is an invalid Instagram url or if not an Instagram url, the function returns None.

from ural.instagram import (
  parse_instagram_url,

  # You can also import the named tuples if you need them
  InstagramPost,
  InstagramUser,
  InstagramReel
)

parse_instagram_url('https://www.instagram.com/martin_dupont/p/BxKRx5CHn5i/')
>>> InstagramPost(id='BxKRx5CHn5i', name='martin_dupont')

parse_instagram_url('https://lemonde.fr')
>>> None

parse_instagram_url('https://www.instagram.com/p/BxKRx5-Hn5i/')
>>> InstagramPost(id='BxKRx5-Hn5i', name=None)

parse_instagram_url('https://www.instagram.com/martin_dupont')
>>> InstagramUser(name='martin_dupont')

parse_instagram_url('https://www.instagram.com/reels/BxKRx5-Hn5i')
>>> InstagramReel(id='BxKRx5-Hn5i')

Arguments

  • url str: Instagram url to parse.

Telegram

is_telegram_message_id

Function returning whether the given string is a valid Telegram message id or not.

from ural.telegram import is_telegram_message_id

is_telegram_message_id('974583586343')
>>> True

is_telegram_message_id('whatever')
>>> False

is_telegram_url

Returns whether the given url is from Telegram.

from ural.telegram import is_telegram_url

is_telegram_url('https://lemonde.fr')
>>> False

is_telegram_url('https://telegram.me/guillaumelatorre')
>>> True

is_telegram_url('https://t.me/s/jesstern')
>>> True

convert_telegram_url_to_public

Function returning the public version of the given Telegram url. Will raise an exception if a non-Telegram url is given.

from ural.teglegram import convert_telegram_url_to_public

convert_telegram_url_to_public('https://t.me/jesstern')
>>> 'https://t.me/s/jesstern'

extract_channel_name_from_telegram_url

Return a channel from the given Telegram url or None if we could not find one.

from ural.telegram import extract_channel_name_from_telegram_url

extract_channel_name_from_telegram_url('https://t.me/s/jesstern/345')
>>> 'jesstern'

extract_channel_name_from_telegram_url('https://lemonde.fr')
>>> None

parse_telegram_url

Returns parsed information about the given telegram url: either about the channel, message or user. If the url is an invalid Telegram url or if not a Telegram url, the function returns None.

from ural.telegram import (
  parse_telegram_url,

  # You can also import the named tuples if you need them
  TelegramMessage,
  TelegramChannel,
  TelegramGroup
)

parse_telegram_url('https://t.me/s/jesstern/76')
>>> TelegramMessage(name='jesstern', id='76')

parse_telegram_url('https://lemonde.fr')
>>> None

parse_telegram_url('https://telegram.me/rapsocialclub')
>>> TelegramChannel(name='rapsocialclub')

parse_telegram_url('https://t.me/joinchat/AAAAAE9B8u_wO9d4NiJp3w')
>>> TelegramGroup(id='AAAAAE9B8u_wO9d4NiJp3w')

Arguments

  • url str: Telegram url to parse.

Twitter

is_twitter_url

Returns whether the given url is from Twitter.

from ural.twitter import is_twitter_url

is_twitter_url('https://lemonde.fr')
>>> False

is_twitter_url('https://www.twitter.com/Yomguithereal')
>>> True

is_twitter_url('https://twitter.com')
>>> True

extract_screen_name_from_twitter_url

Extracts a normalized user's screen name from a Twitter url. If given an irrelevant url, the function will return None.

from ural.twitter import extract_screen_name_from_twitter_url

extract_screen_name_from_twitter_url('https://www.twitter.com/Yomguithereal')
>>> 'yomguithereal'

extract_screen_name_from_twitter_url('https://twitter.fr')
>>> None

parse_twitter_url

Takes a Twitter url and returns either a TwitterUser namedtuple (contains a screen_name) if the given url is a link to a twitter user, a TwitterTweet namedtuple (contains a user_screen_name and an id) if the given url is a tweet's url, a TwitterList namedtuple (contains an id) or None if the given url is irrelevant.

from ural.twitter import parse_twitter_url

parse_twitter_url('https://twitter.com/Yomguithereal')
>>> TwitterUser(screen_name='yomguithereal')

parse_twitter_url('https://twitter.com/medialab_ScPo/status/1284154793376784385')
>>> TwitterTweet(user_screen_name='medialab_scpo', id='1284154793376784385')

parse_twitter_url('https://twitter.com/i/lists/15512656222798157826')
>>> TwitterList(id='15512656222798157826')

parse_twitter_url('https://twitter.com/home')
>>> None

Youtube

is_youtube_url

Returns whether the given url is from Youtube.

from ural.youtube import is_youtube_url

is_youtube_url('https://lemonde.fr')
>>> False

is_youtube_url('https://www.youtube.com/watch?v=otRTOE9i51o')
>>> True

is_youtube_url('https://youtu.be/otRTOE9i51o)
>>> True

is_youtube_video_id

Returns whether the given string is a formally valid Youtube id. Note that it won't validate the fact that this id actually refers to an existing video or not. You will need to call Youtube servers for that.

from ural.youtube import is_youtube_video_id

is_youtube_video_id('otRTOE9i51o')
>>> True

is_youtube_video_id('bDYTYET')
>>> False

parse_youtube_url

Returns parsed information about the given youtube url: either about the linked video, user or channel. If the url is an invalid Youtube url or if not a Youtube url, the function returns None.

from ural.youtube import (
  parse_youtube_url,

  # You can also import the named tuples if you need them
  YoutubeVideo,
  YoutubeUser,
  YoutubeChannel
)

parse_youtube_url('https://www.youtube.com/watch?v=otRTOE9i51o')
>>> YoutubeVideo(id='otRTOE9i51o')

parse_youtube_url('https://lemonde.fr')
>>> None

parse_youtube_url('http://www.youtube.com/channel/UCWvUxN9LAjJ-sTc5JJ3gEyA/videos')
>>> YoutubeChannel(id='UCWvUxN9LAjJ-sTc5JJ3gEyA', name=None)

parse_youtube_url('http://www.youtube.com/user/ojimfrance')
>>> YoutubeUser(id=None, name='ojimfrance')

parse_youtube_url('https://www.youtube.com/taranisnews')
>>> YoutubeChannel(id=None, name='taranisnews')

Arguments

  • url str: Youtube url to parse.
  • fix_common_mistakes bool [True]: Whether to fix common mistakes that can be found in Youtube urls as you can find them when crawling the web.

extract_video_id_from_youtube_url

Return a video id from the given Youtube url or None if we could not find one.

from ural.youtube import extract_video_id_from_youtube_url

extract_video_id_from_youtube_url('https://www.youtube.com/watch?v=otRTOE9i51o')
>>> 'otRTOE9i51o'

extract_video_id_from_youtube_url('https://lemonde.fr')
>>> None

extract_video_id_from_youtube_url('http://youtu.be/afa-5HQHiAs')
>>> 'afa-5HQHiAs'

normalize_youtube_url

Returns a normalized version of the given Youtube url. It will normalize video, user and channel urls so you can easily match them.

from ural.youtube import normalize_youtube_url

normalize_youtube_url('https://www.youtube.com/watch?v=otRTOE9i51o')
>>> 'https://www.youtube.com/watch?v=otRTOE9i51o'

normalize_youtube_url('http://youtu.be/afa-5HQHiAs')
>>> 'https://www.youtube.com/watch?v=afa-5HQHiAs'

Miscellaneous

About LRUs

TL;DR: a LRU is a hierarchical reordering of a URL so that one can perform meaningful prefix queries on URLs.

If you observe many URLs, you will quickly notice that they are not written in sound hierarchical order. In this URL, for instance:

http://business.lemonde.fr/articles/money.html?id=34#content

Some parts, such as the subdomain, are written in an "incorrect order". And this is fine, really, this is how URLs always worked.

But if what you really want is to match URLs, you will need to reorder them so that their order closely reflects the hierarchy of their targeted content. And this is exactly what LRUs are (that and also a bad pun on URL, since a LRU is basically a "reversed" URL).

Now look how the beforementioned URL could be splitted into LRU stems:

[
  's:http',
  'h:fr',
  'h:lemonde',
  'h:business',
  'p:articles',
  'p:money.html',
  'q:id=34',
  'f:content'
]

And typically, this list of stems will be serialized thusly:

s:http|h:fr|h:lemonde|h:business|p:articles|p:money.html|q:id=34|f:content|

The trailing slash is added so that serialized LRUs can be prefix-free.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ural-0.42.0.tar.gz (115.3 kB view details)

Uploaded Source

Built Distribution

ural-0.42.0-py3-none-any.whl (104.7 kB view details)

Uploaded Python 3

File details

Details for the file ural-0.42.0.tar.gz.

File metadata

  • Download URL: ural-0.42.0.tar.gz
  • Upload date:
  • Size: 115.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.9.6 requests/2.28.2 setuptools/47.1.0 requests-toolbelt/0.10.1 tqdm/4.31.1 CPython/3.7.16

File hashes

Hashes for ural-0.42.0.tar.gz
Algorithm Hash digest
SHA256 eaed5ba826f9f8406d85d6595cb25055c4e936a72261749987486c69f7eaa414
MD5 f8eebd52eeddf152f87041ba57e04844
BLAKE2b-256 c544ddcd4f7d46e4e1968c28d57d7df4efb51fd7ed60aa85deb07f5092afecee

See more details on using hashes here.

File details

Details for the file ural-0.42.0-py3-none-any.whl.

File metadata

  • Download URL: ural-0.42.0-py3-none-any.whl
  • Upload date:
  • Size: 104.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.9.6 requests/2.28.2 setuptools/47.1.0 requests-toolbelt/0.10.1 tqdm/4.31.1 CPython/3.7.16

File hashes

Hashes for ural-0.42.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7d349a10a35301f543670f28298201a9895a689c5cca2f63f570fd5699db9295
MD5 59d12c5a14a15df2c8347ec1168b1e05
BLAKE2b-256 00eedba3e915ff17f6d18ccf8693fd87a3c52753ce6e64e0e320d360b0c694b7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page