An experimental diff library for generating operation deltas that represent the difference between two sequences of comparable items.
Project description
Deltas
An open licensed (MIT) library for performing generating deltas (A.K.A sequences of operations) representing the difference between two sequences of comparable tokens.
- Installation:
pip install deltas
- Repo: http://github.com/halfak/Deltas
- Documentation: http://pythonhosted.org/deltas
- Note this library requires Python 3.3 or newer
This library is intended to be used to make experimental difference detection strategies more easily available. There are currently two strategies available:
deltas.sequence_matcher.diff(a, b)
:
A shameless wrapper around difflib.SequenceMatcher
to get it to work within the structure of deltas.
deltas.segment_matcher.diff(a, b, segmenter=None)
:
A generalized difference detector that is designed to detect block moves and copies based on the use of a Segmenter
.
Example:
from deltas import segment_matcher, text_split
a = text_split.tokenize("This is some text. This is some other text.")`|
b = text_split.tokenize("This is some other text. This is some text.")
operations = segment_matcher.diff(a, b)
for op in operations:
print(op.name, repr(''.join(a[op.a1:op.a2])),
repr(''.join(b[op.b1:op.b2])))
equal 'This is some other text.' 'This is some other text.'
insert ' ' ' '
equal 'This is some text.' 'This is some text.'
delete ' ' ''
Tokenization
By default Deltas performs tokenization by regexp text splitting. We included CJK tokenization functionality. If text consists of at least 1/4 (default value) Japanse or Korean symbols it is tokenized by language specific Tokenizer. Else, Chinese Tokenizer is used.
- Chinese Tokenizer - Jieba
- Japanese Tokenizer - Sudachi
- Korean Tokenizer - KoNLPy(Okt)
Tokenization example:
import mwapi
import deltas
import deltas.tokenizers
# example title ["China", "Haiku", "Kimchi"]: "中国" - Chinese(zh), "俳句" - Japanese(ja), "김치" - Korean(ko)
session = mwapi.Session("https://zh.wikipedia.org")
doc = session.get(action="query", prop="revisions", titles="中国", rvprop="content", rvslots="main",formatversion=2)
text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']
# text processed only by regexp tokenizer
tokenized_text = deltas.tokenizers.wikitext_split.tokenize(text)
# text processed regexp tokenizer with cjk post processing
tokenized_text_cjk = deltas.tokenizers.wikitext_split_w_cjk.tokenize(text)
FOR IMPROVED JAPANESE TOKENIZER ACCURACY PLEASE INSTALL FULL DICTIONARY:
pip install sudachidict_full
# and link sudachi to dict
sudachipy link -t full
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file deltas-0.6.0.tar.gz
.
File metadata
- Download URL: deltas-0.6.0.tar.gz
- Upload date:
- Size: 25.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.6.1 requests/2.25.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.52.0 CPython/3.5.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e14713f85c7ba8af590b39207c1346e1fea08c64c2357372b8a36aecf75e6bcb |
|
MD5 | 96f6f0547fa492593d823084050641f3 |
|
BLAKE2b-256 | 42f27fafbd289f1024cefb3bc9e9038eb6096ac133743e95ce012b332c8641b2 |
File details
Details for the file deltas-0.6.0-py2.py3-none-any.whl
.
File metadata
- Download URL: deltas-0.6.0-py2.py3-none-any.whl
- Upload date:
- Size: 34.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.15.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8baa72c40b317aa6d4d2e09570f13c43b4a9616218df94f2442d5259aafc1f1d |
|
MD5 | 577b1bb5c7488fe16b408fef97567a71 |
|
BLAKE2b-256 | 7eed70bd8fa4d45166a397ac550dadd76177bf33b6ff2777a265c96117fdad1d |