Skip to main content

An experimental diff library for generating operation deltas that represent the difference between two sequences of comparable items.

Project description

Deltas

An open licensed (MIT) library for performing generating deltas (A.K.A sequences of operations) representing the difference between two sequences of comparable tokens.

This library is intended to be used to make experimental difference detection strategies more easily available. There are currently two strategies available:

deltas.sequence_matcher.diff(a, b):
A shameless wrapper around difflib.SequenceMatcher to get it to work within the structure of deltas.

deltas.segment_matcher.diff(a, b, segmenter=None):
A generalized difference detector that is designed to detect block moves and copies based on the use of a Segmenter.

Example:

from deltas import segment_matcher, text_split
a = text_split.tokenize("This is some text. This is some other text.")`|
b = text_split.tokenize("This is some other text. This is some text.")
operations = segment_matcher.diff(a, b)

for op in operations:
 print(op.name, repr(''.join(a[op.a1:op.a2])),
  repr(''.join(b[op.b1:op.b2])))

equal 'This is some other text.' 'This is some other text.'
insert ' ' ' '
equal 'This is some text.' 'This is some text.'
delete ' ' ''

Tokenization

By default Deltas performs tokenization by regexp text splitting. We included CJK tokenization functionality. If text consists of at least 1/4 (default value) Japanse or Korean symbols it is tokenized by language specific Tokenizer. Else, Chinese Tokenizer is used.

  • Chinese Tokenizer - Jieba
  • Japanese Tokenizer - Sudachi
  • Korean Tokenizer - KoNLPy(Okt)

Tokenization example:

import mwapi
import deltas
import deltas.tokenizers

# example title ["China", "Haiku", "Kimchi"]: "中国" - Chinese(zh), "俳句" - Japanese(ja), "김치" - Korean(ko)
session = mwapi.Session("https://zh.wikipedia.org")
doc = session.get(action="query", prop="revisions", titles="中国", rvprop="content", rvslots="main",formatversion=2)
text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']

# text processed only by regexp tokenizer
tokenized_text = deltas.tokenizers.wikitext_split.tokenize(text)
# text processed regexp tokenizer with cjk post processing
tokenized_text_cjk = deltas.tokenizers.wikitext_split_w_cjk.tokenize(text)

FOR IMPROVED JAPANESE TOKENIZER ACCURACY PLEASE INSTALL FULL DICTIONARY:

pip install sudachidict_full
# and link sudachi to dict
sudachipy link -t full

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltas-0.6.2.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

deltas-0.6.2-py2.py3-none-any.whl (35.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file deltas-0.6.2.tar.gz.

File metadata

  • Download URL: deltas-0.6.2.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.6.1 requests/2.25.1 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.5.10

File hashes

Hashes for deltas-0.6.2.tar.gz
Algorithm Hash digest
SHA256 0f4c6ff40e4a85ac14e7c388cfcebe5054492fe7c1a5763c841c3ff5437e81f2
MD5 f9981b755b782c466d40b037d08dfcc1
BLAKE2b-256 8697836e2992c9a4dcb129792079e4f08e312d7ad3297de24dc43b1736673271

See more details on using hashes here.

File details

Details for the file deltas-0.6.2-py2.py3-none-any.whl.

File metadata

  • Download URL: deltas-0.6.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 35.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.6.1 requests/2.25.1 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.5.10

File hashes

Hashes for deltas-0.6.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ac29035c519f7ff2a446b20c6c1693a7dcf676922d4ca1f61a194eb226fd79b9
MD5 8e865a00a50c3638bc5a38ae85af50a6
BLAKE2b-256 67d6259b3ecc7536b3ce8da89bf315c9341c8fe063f9f979e829ab9ce8a42790

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page