Skip to main content

A fuzzy matching & clustering library for python.

Project description

Build Status

Fog

A fuzzy matching/clustering library for Python.

Installation

You can install fog with pip with the following command:

pip install fog

Usage

Graph

floatsam_sparsification

Function using an iterative algorithm to try and find the best weight threshold to apply to trim the given graph's edges while keeping the underlying community structures.

It works by iteratively increasing the threshold and stopping as soon as a significant connected component starts to drift away from the principal one.

This is basically a very naive gradient descent with a very naive cost function but it works decently for typical cases.

Arguments

  • graph nx.Graph: Graph to sparsify.
  • starting_treshold float: Starting similarity threshold.
  • learning_rate ?float [0.05]: How much to increase the threshold at each step of the algorithm.
  • max_drifter_size ?int: Max size of component to detach itself from the principal one before stopping the algorithm. If not provided it will default to the logarithm of the graph's total number of nodes.
  • weight ?str [weight wrt networkx conventions]: Name of the weight attribute.
  • remove_edges ?bool [False]: Whether to remove edges from the graph having a weight less than found threshold or not. Note that if True, this will mutate the given graph.

monopartite_projection

Function computing a monopartite projection of the given bipartite graph. This projection can be basic and create a weighted edge each time two nodes in target partition share a common neighbor. Or it can be weighted and filtered using a similarity metric such as Jaccard or cosine similarity, for instance.

Arguments

  • bipartite nx.Graph: Target bipartite graph.
  • project str: Name of the partition to project.
  • part ?str [bipartite]: Name of the node attribute on which the graph partition is built e.g. "color" or "type" etc.
  • weight ?str [weight]: Name of the weight edge attribute.
  • metric ?str [None]: Metric to use. If None, the basic projection will be returned. Also accepts jaccard, overlap, dice, cosine or binary_cosine.
  • threshold ?float [None]: Optional similarity threshold under which edges won't be added to the monopartite projection.
  • use_topology ?bool: Whether to use the bipartite graph's topology to attempt a subquadratic time projection. Intuitively, this works by not computing similarities of all pairs of nodes but only of pairs of nodes that share at least a common neighbor. It generally works better than the quadratic approach but can sometimes hurt your performance by losing time on graph traversals when your graph is very dense.
  • bipartition_check ?bool: This function will start by checking whether your graph is bipartite because it can get stuck in an infinite loop if given graph is not truly bipartite. Be sure to disable this kwarg if you know beforehand that your graph is bipartite and for better performance.

Keyers

omission_key

Function returning a string's omission key which is constructed thusly:

  1. First we record the string's set of consonant in an order where most frequently mispelled consonants will be last.
  2. Then we record the string's set of vowels in the order of first appearance.

This key is very useful when searching for mispelled strings because if sorted using this key, similar strings will be next to each other.

Arguments

  • string str: The string to encode.

skeleton_key

Function returning a string's skeleton key which is constructed thusly:

  1. The first letter of the string
  2. Unique consonants in order of appearance
  3. Unique vowels in order of appearance

This key is very useful when searching for mispelled strings because if sorted using this key, similar strings will be next to each other.

Arguments

  • string str: The string to encode.

Metrics

cosine_similarity

Function computing the cosine similarity of the given sequences. Runs in O(n), n being the sum of A & B's sizes.

Arguments

  • A iterable: First sequence.
  • B iterable: Second sequence.

sparse_cosine_similarity

Function computing cosine similarity on sparse weighted sets represented as python dicts.

Runs in O(n), n being the sum of A & B's sizes.

from fog.metrics import sparse_cosine_similarity

# Basic
sparse_cosine_similarity({'apple': 34, 'pear': 3}, {'pear': 1, 'orange': 1})
>>> ~0.062

Arguments

  • A Counter: First weighted set.
  • B Counter: Second weighted set.

sparse_dot_product

Function used to compute the dotproduct of sparse weighted sets represented by python dicts.

Runs in O(n), n being the size of the smallest set.

Arguments

  • A Counter: First weighted set.
  • B Counter: Second weighted set.

binary_cosine_similarity

Function computing the binary cosine similarity of the given sequences. Runs in O(n), n being the size of the smallest set.

Arguments

  • A iterable: First sequence.
  • B iterable: Second sequence.

sparse_binary_cosine_similarity

Function computing binary cosine similarity on sparse vectors represented as python sets.

Runs in O(n), n being the size of the smaller set.

Arguments

  • A Counter: First set.
  • B Counter: Second set.

dice_coefficient

Function computing the Dice coefficient. That is to say twice the size of the intersection of both sets divided by the sum of both their sizes.

Runs in O(n), n being the size of the smallest set.

from fog.metrics import dice_coefficient

# Basic
dice_coefficient('context', 'contact')
>>> ~0.727

Arguments

  • A iterable: First sequence.
  • B iterable: Second sequence.

jaccard_similarity

Function computing the Jaccard similarity. That is to say the intersection of input sets divided by their union.

Runs in O(n), n being the size of the smallest set.

from fog.metrics import jaccard_similarity

# Basic
jaccard_similarity('context', 'contact')
>>> ~0.571

Arguments

  • A iterable: First sequence.
  • B iterable: Second sequence.

weighted_jaccard_similarity

Function computing the weighted Jaccard similarity. Runs in O(n), n being the sum of A & B's sizes.

from fog.metrics import weighted_jaccard_similarity

# Basic
weighted_jaccard_similarity({'apple': 34, 'pear': 3}, {'pear': 1, 'orange': 1})
>>> ~0.026

Arguments

  • A Counter: First weighted set.
  • B Counter: Second weighted set.

overlap_coefficient

Function computing the overlap coefficient of the given sets, i.e. the size of their intersection divided by the size of the smallest set.

Runs in O(n), n being the size of the smallest set.

Arguments

  • A iterable: First sequence.
  • B iterable: Second sequence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fog-0.9.0.tar.gz (79.5 kB view details)

Uploaded Source

Built Distribution

fog-0.9.0-cp36-cp36m-macosx_10_13_x86_64.whl (89.4 kB view details)

Uploaded CPython 3.6m macOS 10.13+ x86-64

File details

Details for the file fog-0.9.0.tar.gz.

File metadata

  • Download URL: fog-0.9.0.tar.gz
  • Upload date:
  • Size: 79.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.6.9

File hashes

Hashes for fog-0.9.0.tar.gz
Algorithm Hash digest
SHA256 2fd8b3777ecf52d5cd879361731876658dc4a0fa04856509d1e5f4fdab8a3bd2
MD5 8a979df0cb7d8823b743b448bdb37589
BLAKE2b-256 a1cdaa3a6391daba5dde27dc2eb6b392f7a7c05c963e629e39ac11eae83666c1

See more details on using hashes here.

File details

Details for the file fog-0.9.0-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: fog-0.9.0-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 89.4 kB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.6.9

File hashes

Hashes for fog-0.9.0-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 9e0ee718815ecde125d352ada70c3e9098e6941da46d063fe3a75b3483fca0fa
MD5 a990b73350142bc92c43222b752ef137
BLAKE2b-256 3357b91bb335e54d8150fe426c958a93f34ce5f70ad8ba75965fb3140284b160

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page